date:20110425

Re: [PATCH 1/1 v2] KVM: MMU: Optimize guest page table walk

2011-04-25 Thread Jan Kiszka

On 2011-04-21 17:34, Takuya Yoshikawa wrote:
 From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 
 This patch optimizes the guest page table walk by using get_user()
 instead of copy_from_user().
 
 With this patch applied, paging64_walk_addr_generic() has become
 about 0.5us to 1.0us faster on my Phenom II machine with NPT on.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 ---
  arch/x86/kvm/paging_tmpl.h |   23 ---
  1 files changed, 20 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 74f8567..825d953 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -117,6 +117,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
 *walker,
   gva_t addr, u32 access)
  {
   pt_element_t pte;
 + pt_element_t __user *ptep_user;
   gfn_t table_gfn;
   unsigned index, pt_access, uninitialized_var(pte_access);
   gpa_t pte_gpa;
 @@ -152,6 +153,9 @@ walk:
   pt_access = ACC_ALL;
  
   for (;;) {
 + gfn_t real_gfn;
 + unsigned long host_addr;
 +
   index = PT_INDEX(addr, walker-level);
  
   table_gfn = gpte_to_gfn(pte);
 @@ -160,9 +164,22 @@ walk:
   walker-table_gfn[walker-level - 1] = table_gfn;
   walker-pte_gpa[walker-level - 1] = pte_gpa;
  
 - if (kvm_read_guest_page_mmu(vcpu, mmu, table_gfn, pte,
 - offset, sizeof(pte),
 - PFERR_USER_MASK|PFERR_WRITE_MASK)) {
 + real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn),
 +   PFERR_USER_MASK|PFERR_WRITE_MASK);
 + if (real_gfn == UNMAPPED_GVA) {
 + present = false;
 + break;
 + }
 + real_gfn = gpa_to_gfn(real_gfn);
 +
 + host_addr = gfn_to_hva(vcpu-kvm, real_gfn);
 + if (kvm_is_error_hva(host_addr)) {
 + present = false;
 + break;
 + }
 +
 + ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
 + if (get_user(pte, ptep_user)) {

This doesn't work for x86-32: pte is 64 bit, but get_user is only
defined up to 32 bit on that platform.

Avi, what's your 32-bit buildbot doing? :)

Jan



signature.asc
Description: OpenPGP digital signature

Re: A Live Backup feature for KVM

2011-04-25 Thread Jagane Sundar


Hello Stefan,

It's good to know that live snapshots and online backup are useful
functions.

I read through the two snapshot proposals that you pointed me at.

The direction that I chose to go is slightly different. In both of the
proposals you pointed me at, the original virtual disk is made
read-only and the VM writes to a different COW file. After backup
of the original virtual disk file is complete, the COW file is merged
with the original vdisk file.

Instead, I create an Original-Blocks-COW-file to store the original
blocks that are overwritten by the VM everytime the VM performs
a write while the backup is in progress. Livebackup copies these
underlying blocks from the original virtual disk file before the VM's
write to the original virtual disk file is scheduled. The advantage of
this is that there is no merge necessary at the end of the backup, we
can simply delete the Original-Blocks-COW-file.

I have some reasons to believe that the Original-Blocks-COW-file
design that I am putting forth might work better. I have listed them
below. (It's past midnight here, so pardon me if it sounds garbled -- I
will try to clarify more in a writeup on wiki.qemu.org).
Let me know what your thoughts are..

I feel that the livebackup mechanism will impact the running VM
less. For example, if something goes wrong with the backup process,
then we can simply delete the Original-Blocks-COW-file and force
the backup client to do a full backup the next time around. The
running VM or its virtual disks are not impacted at all.

Adjunct functionality such as block migration and live migration
might work easier with the Original-Blocks-COW-file way, since
the original virtual disk file functions as the only virtual disk
file for the VM. If a live migration needs to happen while a
backup is in progress, we can just delete the Original-Blocks-COW-file
and be on our way.

Livebackup includes a rudimentary network protocol to transfer
the modified blocks to a livebackup_client. It supports incremental
backups. Also, livebackup treats a backup as containing all the virtual
disks of a VM. Hence a snapshot in livebackup terms refer to a
snapshot of all the virtual disks.

The approximate sequence of operation is as follows:
1. VM boots up. When bdrv_open_common opens any file backed
virtual disk, it checks for a file called base_file.livebackupconf.
If such a file exists, then the virtual disk is part of the backup set,
and a chunk of memory is allocated to keep track of dirty blocks.
2. qemu starts up a  livebackup thread that listens on a specified port
(e.g) port 7900, for connections from the livebackup client.
3. The livebackup_client connects to qemu at port 7900.
4. livebackup_client sends a 'do snapshot' command.
5. qemu waits 30 seconds for outstanding asynchronous I/O to complete.
6. When there are no more outstanding async I/O requests, qemu
copies the dirty_bitmap to its snapshot structure and starts a new 
dirty

bitmap.
7. livebackup_client starts iterating through the list of dirty blocks, and
starts saving these blocks to the backup image
8. When all blocks have been backed up, then the backup_client sends a
destroy snapshot command; the server simply deletes the
Original-Blocks-COW-files for each of the virtual disks and frees the
calloc'd memory holding the dirty blocks list.

Thanks for the pointers to virtagent and fsfreeze. fsfreeze looks 
exactly like

what is necessary to quiesce file system activity.

I have pushed my code to the following git tree.
git://github.com/jagane/qemu-kvm-livebackup.git

It started as a clone of the linux kvm tree at:

git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git

If you want to look at the code, see livebackup.[ch] and livebackup_client.c

This is very much a work in progress, and I expect to do a lot of
testing/debugging over the next few weeks. I will also create a
detailed proposal on wiki.qemu.org, with much more information.

Thanks,
Jagane

On 4/24/2011 1:32 AM, Stefan Hajnoczi wrote:

On Sun, Apr 24, 2011 at 12:17 AM, Jagane Sundarjag...@sundar.org  wrote:

I would like to get your input on a KVM feature that I am
currently developing.

What it does is this - it can perform full and incremental
disk backups of running KVM VMs, where a backup is defined
as a snapshot of the disk state of all virtual disks
configured for the VM.

Great, there is definitely demand for live snapshots and online
backup.  Some efforts are already underway to implement this.

Jes has worked on a live snapshot feature for online backups.  The
snapshot_blkdev QEMU monitor command is available in qemu.git and
works like this:
qemu  snapshot_blockdev virtio-disk0 /tmp/new-img.qcow2

It will create a new image file backed by the current image file.  It
then switches the VM disk to the new image file.  All writes will go
to the new image file.  The backup software on the host can now read
from the original image file since it will not be

Re: [PATCH 1/1 v2] KVM: MMU: Optimize guest page table walk

2011-04-25 Thread Takuya Yoshikawa

On Mon, 25 Apr 2011 10:04:43 +0200
Jan Kiszka jan.kis...@web.de wrote:

  +
  +   ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
  +   if (get_user(pte, ptep_user)) {
 
 This doesn't work for x86-32: pte is 64 bit, but get_user is only
 defined up to 32 bit on that platform.
 
 Avi, what's your 32-bit buildbot doing? :)
 
 Jan
 

Sorry, I did not test on x86_32.

Introducing a wrapper function with ifdef would be the best way?


Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: LEAVE emulation infinite loop

2011-04-25 Thread Matteo Signorini

Hi to All,

I fixed the previously highlighted error, calling the right pop
emulation function
but still get the same error, an infinite leave emulation loop.
IMHO this is not an emulation error since x86_decode_insn and
x86_emulate_insn return a correct value ( r=0 )
so I don't understand what I'm doing wrong...
could you please give me an hint to fix it?

Thank you in advance,
Matteo

*** emulate.c   2011-04-21 13:19:11.535663092 +0200

--- myemulate.c 2011-04-21 13:34:21.490313650 +0200
*** static struct opcode opcode_table[256] =
*** 2504,2510 
       D(DstReg | SrcMemFAddr | ModRM | No64), D(DstReg | SrcMemFAddr
| ModRM | No64),
       G(ByteOp, group11), G(0, group11),
       /* 0xC8 - 0xCF */
!       N, N, N, D(ImplicitOps | Stack),
       D(ImplicitOps), D(SrcImmByte), D(ImplicitOps | No64),
D(ImplicitOps),
       /* 0xD0 - 0xD7 */
       D2bv(DstMem | SrcOne | ModRM), D2bv(DstMem | ModRM),
--- 2504,2510 
       D(DstReg | SrcMemFAddr | ModRM | No64), D(DstReg | SrcMemFAddr
| ModRM | No64),
       G(ByteOp, group11), G(0, group11),
       /* 0xC8 - 0xCF */
!       N, D(ImplicitOps | SrcNone), N, D(ImplicitOps | Stack),
       D(ImplicitOps), D(SrcImmByte), D(ImplicitOps | No64),
D(ImplicitOps),
       /* 0xD0 - 0xD7 */
       D2bv(DstMem | SrcOne | ModRM), D2bv(DstMem | ModRM),
*** special_insn:
*** 3259,3264 
--- 3259,3268 
       case 0xc5:              /* lds */
               rc = emulate_load_segment(ctxt, ops, VCPU_SREG_DS);
               break;
+       case 0xc9:              /* leave */
+               c-regs[VCPU_REGS_RSP] = c-regs[VCPU_REGS_RBP];
+               rc = emulate_pop(ctxt, ops, c-regs[VCPU_REGS_RBP],
c-op_bytes);
+               goto done;
       case 0xcb:              /* ret far */
               rc = emulate_ret_far(ctxt, ops);
               break;



2011/4/24 Avi Kivity a...@redhat.com

 On 04/24/2011 10:08 AM, Matteo Signorini wrote:

 Hello everybody,
 I have a problem with an opcode emulation not yet emulated in kvm-kmod
 2.6.38-rc7.
 The opcode is the LEAVE that as Intel Manual says:

 Set RSP to RBP, then pop RBP

 The problem is that despite to the fact that the opcode of the leave
 (C9) is correctly fetched and decoded, it falls in an infinite loop
 (found by some printk debug prints)

 Now I'm wondering...the eip needed in order to continue the vm
 execution is moved-on by the insns_fetch operation so after the first
 byte decode of the LEAVE opcode I shouldn't execute it again...so what
 I'm doing wrong?

 I posted here the diff output so you can see which changes I made on
 kvm original source code


         case 0xc5:              /* lds */
                 rc = emulate_load_segment(ctxt, ops, VCPU_SREG_DS);
                 break;
 +       case 0xc9:              /* leave */
 +               c-regs[VCPU_REGS_RSP] = c-regs[VCPU_REGS_RBP];
 +               rc = emulate_pop_sreg(ctxt, ops, VCPU_REGS_RBP);
 +               goto done;
         case 0xcb:              /* ret far */
                 rc = emulate_ret_far(ctxt, ops);
                 break;



 Why are you calling emulate_pop_sreg()? RBP is not a segment register.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1 v2] KVM: MMU: Optimize guest page table walk

2011-04-25 Thread Jan Kiszka

On 2011-04-25 10:32, Takuya Yoshikawa wrote:
 On Mon, 25 Apr 2011 10:04:43 +0200
 Jan Kiszka jan.kis...@web.de wrote:
 
 +
 +   ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
 +   if (get_user(pte, ptep_user)) {
 
 This doesn't work for x86-32: pte is 64 bit, but get_user is only
 defined up to 32 bit on that platform.

 Avi, what's your 32-bit buildbot doing? :)

 Jan

 
 Sorry, I did not test on x86_32.
 
 Introducing a wrapper function with ifdef would be the best way?
 

Maybe you could also add the missing 64-bit get_user for x86-32. Given
that we have a corresponding put_user, I wonder why the get_user was
left out.

Jan



signature.asc
Description: OpenPGP digital signature

[PATCH 13/18] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

event-tap function is called only when it is on.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 net.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net.c b/net.c
index 4f777c3..8bcc504 100644
--- a/net.c
+++ b/net.c
@@ -36,6 +36,7 @@
 #include qemu_socket.h
 #include hw/qdev.h
 #include iov.h
+#include event-tap.h
 
 static QTAILQ_HEAD(, VLANState) vlans;
 static QTAILQ_HEAD(, VLANClientState) non_vlan_clients;
@@ -518,6 +519,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender,
 
 void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size)
 {
+if (event_tap_is_on()) {
+return event_tap_send_packet(vc, buf, size);
+}
+
 qemu_send_packet_async(vc, buf, size, NULL);
 }
 
@@ -599,6 +604,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender,
 {
 NetQueue *queue;
 
+if (event_tap_is_on()) {
+return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb);
+}
+
 if (sender-link_down || (!sender-peer  !sender-vlan)) {
 return iov_size(iov, iovcnt);
 }
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/18] qemu-char: export socket_set_nodelay().

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 qemu-char.c   |2 +-
 qemu_socket.h |1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index 03858d4..fef33b0 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2115,7 +2115,7 @@ static void tcp_chr_telnet_init(int fd)
 send(fd, (char *)buf, 3, 0);
 }
 
-static void socket_set_nodelay(int fd)
+void socket_set_nodelay(int fd)
 {
 int val = 1;
 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val));
diff --git a/qemu_socket.h b/qemu_socket.h
index 180e4db..a05e1e5 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
 int qemu_socket(int domain, int type, int protocol);
 int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
 void socket_set_nonblock(int fd);
+void socket_set_nodelay(int fd);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/18] Introduce read() to FdMigrationState.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Currently FdMigrationState doesn't support read(), and this patch
introduces it to get response from the other side.  Note that this
won't change the existing migration protocol to be bi-directional.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 migration-tcp.c |   15 +++
 migration.c |   13 +
 migration.h |3 +++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index d3d80c9..bb67d53 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -38,6 +38,20 @@ static int socket_write(FdMigrationState *s, const void * 
buf, size_t size)
 return send(s-fd, buf, size, 0);
 }
 
+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+ssize_t len;
+
+do {
+len = recv(s-fd, (void *)buf, size, 0);
+} while (len == -1  socket_error() == EINTR);
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF(tcp_close\n);
@@ -93,6 +107,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 
 s-get_error = socket_errno;
 s-write = socket_write;
+s-read = socket_read;
 s-close = tcp_close;
 s-mig_state.cancel = migrate_fd_cancel;
 s-mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index af3a1f2..302b8fe 100644
--- a/migration.c
+++ b/migration.c
@@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
 return ret;
 }
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size)
+{
+FdMigrationState *s = opaque;
+int ret;
+
+ret = s-read(s, data, size);
+if (ret == -1) {
+ret = -(s-get_error(s));
+}
+
+return ret;
+}
+
 void migrate_fd_connect(FdMigrationState *s)
 {
 int ret;
diff --git a/migration.h b/migration.h
index 050c56c..6a76f77 100644
--- a/migration.h
+++ b/migration.h
@@ -48,6 +48,7 @@ struct FdMigrationState
 int (*get_error)(struct FdMigrationState*);
 int (*close)(struct FdMigrationState*);
 int (*write)(struct FdMigrationState*, const void *, size_t);
+int (*read)(struct FdMigrationState *, const void *, size_t);
 void *opaque;
 };
 
@@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size);
+
 void migrate_fd_connect(FdMigrationState *s);
 
 void migrate_fd_put_ready(void *opaque);
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/18] savevm: introduce util functions to control ft_trans_file from savevm layer.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

To utilize ft_trans_file function, savevm needs interfaces to be
exported.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 hw/hw.h  |5 ++
 savevm.c |  150 ++
 2 files changed, 155 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index f90ff15..2d4d595 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc 
*put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size);
 void qemu_put_byte(QEMUFile *f, int v);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_ft_trans_begin(QEMUFile *f);
+int qemu_ft_trans_commit(QEMUFile *f);
+int qemu_ft_trans_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
@@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f);
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
  * resume output. */
 void qemu_file_put_notify(QEMUFile *f);
+void qemu_file_get_notify(void *opaque);
 
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
diff --git a/savevm.c b/savevm.c
index d017760..5b57e94 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,7 @@
 #include qemu_socket.h
 #include qemu-queue.h
 #include cpus.h
+#include ft_trans_file.h
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -190,6 +191,13 @@ typedef struct QEMUFileSocket
 QEMUFile *file;
 } QEMUFileSocket;
 
+typedef struct QEMUFileSocketTrans
+{
+int fd;
+QEMUFileSocket *s;
+VMChangeStateEntry *e;
+} QEMUFileSocketTrans;
+
 static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
 {
 QEMUFileSocket *s = opaque;
@@ -205,6 +213,22 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, 
int64_t pos, int size)
 return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+QEMUFileSocket *s = opaque;
+ssize_t len;
+
+do {
+len = send(s-fd, (void *)buf, size, 0);
+} while (len == -1  socket_error() == EINTR);
+
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int socket_close(void *opaque)
 {
 QEMUFileSocket *s = opaque;
@@ -212,6 +236,71 @@ static int socket_close(void *opaque)
 return 0;
 }
 
+static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, 
size_t size)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+ssize_t len;
+
+len = socket_get_buffer(s, buf, pos, size);
+
+return len;
+}
+
+static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t 
size)
+{
+QEMUFileSocketTrans *t = opaque;
+
+return socket_put_buffer(t-s, buf, size);
+}
+
+static int qemu_loadvm_state_no_header(QEMUFile *f);
+
+static int socket_trans_get_ready(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+QEMUFile *f = s-file;
+int ret = 0;
+
+ret = qemu_loadvm_state_no_header(f);
+if (ret  0) {
+fprintf(stderr,
+socket_trans_get_ready: error while loading vmstate\n);
+}
+
+return ret;
+}
+
+static int socket_trans_close(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+
+qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
+qemu_set_fd_handler2(t-fd, NULL, NULL, NULL, NULL);
+qemu_del_vm_change_state_handler(t-e);
+close(s-fd);
+close(t-fd);
+qemu_free(s);
+qemu_free(t);
+
+return 0;
+}
+
+static void socket_trans_resume(void *opaque, int running, int reason)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+
+if (!running) {
+return;
+}
+
+qemu_announce_self();
+qemu_fclose(s-file);
+}
+
 static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int 
size)
 {
 QEMUFileStdio *s = opaque;
@@ -334,6 +423,26 @@ QEMUFile *qemu_fopen_socket(int fd)
 return s-file;
 }
 
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd)
+{
+QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans));
+QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
+
+t-s = s;
+t-fd = s_fd;
+t-e = qemu_add_vm_change_state_handler(socket_trans_resume, t);
+
+s-fd = c_fd;
+s-file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer,
+  socket_trans_get_buffer, NULL,
+

[PATCH 01/18] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 hw/hw.h  |2 ++
 savevm.c |   20 +++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 1b09039..f90ff15 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void *qemu_realloc_buffer(QEMUFile *f, int size);
+void qemu_clear_buffer(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index f4ff1a1..9cf0258 100644
--- a/savevm.c
+++ b/savevm.c
@@ -172,7 +172,8 @@ struct QEMUFile {
when reading */
 int buf_index;
 int buf_size; /* 0 when writing */
-uint8_t buf[IO_BUF_SIZE];
+int buf_max_size;
+uint8_t *buf;
 
 int has_error;
 };
@@ -423,6 +424,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, 
QEMUFilePutBufferFunc *put_buffer,
 f-get_rate_limit = get_rate_limit;
 f-is_write = 0;
 
+f-buf_max_size = IO_BUF_SIZE;
+f-buf = qemu_malloc(sizeof(uint8_t) * f-buf_max_size);
+
 return f;
 }
 
@@ -453,6 +457,19 @@ void qemu_fflush(QEMUFile *f)
 }
 }
 
+void *qemu_realloc_buffer(QEMUFile *f, int size)
+{
+f-buf_max_size = size;
+f-buf = qemu_realloc(f-buf, f-buf_max_size);
+
+return f-buf;
+}
+
+void qemu_clear_buffer(QEMUFile *f)
+{
+f-buf_size = f-buf_index = f-buf_offset = 0;
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
 int len;
@@ -478,6 +495,7 @@ int qemu_fclose(QEMUFile *f)
 qemu_fflush(f);
 if (f-close)
 ret = f-close(f-opaque);
+qemu_free(f-buf);
 qemu_free(f);
 return ret;
 }
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/18] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Introduce migrate_ft_trans_put_ready() which kicks the FT transaction
cycle.  When ft_mode is on, migrate_fd_put_ready() would open
ft_trans_file and turn on event_tap.  To end or cancel FT transaction,
ft_mode and event_tap is turned off.  migrate_ft_trans_get_ready() is
called to receive ack from the receiver.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 migration.c |  266 ++-
 1 files changed, 265 insertions(+), 1 deletions(-)

diff --git a/migration.c b/migration.c
index 1c2d956..d536df0 100644
--- a/migration.c
+++ b/migration.c
@@ -21,6 +21,7 @@
 #include qemu_socket.h
 #include block-migration.h
 #include qemu-objects.h
+#include event-tap.h
 
 //#define DEBUG_MIGRATION
 
@@ -283,6 +284,17 @@ void migrate_fd_error(FdMigrationState *s)
 migrate_fd_cleanup(s);
 }
 
+static void migrate_ft_trans_error(FdMigrationState *s)
+{
+ft_mode = FT_ERROR;
+qemu_savevm_state_cancel(s-mon, s-file);
+migrate_fd_error(s);
+/* we need to set vm running to avoid assert in virtio-net */
+vm_start();
+event_tap_unregister();
+vm_stop(0);
+}
+
 int migrate_fd_cleanup(FdMigrationState *s)
 {
 int ret = 0;
@@ -318,6 +330,17 @@ void migrate_fd_put_notify(void *opaque)
 qemu_file_put_notify(s-file);
 }
 
+static void migrate_fd_get_notify(void *opaque)
+{
+FdMigrationState *s = opaque;
+
+qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
+qemu_file_get_notify(s-file);
+if (qemu_file_has_error(s-file)) {
+migrate_ft_trans_error(s);
+}
+}
+
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
 {
 FdMigrationState *s = opaque;
@@ -353,6 +376,10 @@ int migrate_fd_get_buffer(void *opaque, uint8_t *data, 
int64_t pos, size_t size)
 ret = -(s-get_error(s));
 }
 
+if (ret == -EAGAIN) {
+qemu_set_fd_handler2(s-fd, NULL, migrate_fd_get_notify, NULL, s);
+}
+
 return ret;
 }
 
@@ -379,6 +406,230 @@ void migrate_fd_connect(FdMigrationState *s)
 migrate_fd_put_ready(s);
 }
 
+static int migrate_ft_trans_commit(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_COMMIT  ft_mode != FT_TRANSACTION_ATOMIC) {
+fprintf(stderr,
+migrate_ft_trans_commit: invalid ft_mode %d\n, ft_mode);
+goto out;
+}
+
+do {
+if (ft_mode == FT_TRANSACTION_ATOMIC) {
+if (qemu_ft_trans_begin(s-file)  0) {
+fprintf(stderr, qemu_ft_trans_begin failed\n);
+goto out;
+}
+
+ret = qemu_savevm_trans_begin(s-mon, s-file, 0);
+if (ret  0) {
+fprintf(stderr, qemu_savevm_trans_begin failed\n);
+goto out;
+}
+
+ft_mode = FT_TRANSACTION_COMMIT;
+if (ret) {
+/* don't proceed until if fd isn't ready */
+goto out;
+}
+}
+
+/* make the VM state consistent by flushing outstanding events */
+vm_stop(0);
+
+/* send at full speed */
+qemu_file_set_rate_limit(s-file, 0);
+
+ret = qemu_savevm_trans_complete(s-mon, s-file);
+if (ret  0) {
+fprintf(stderr, qemu_savevm_trans_complete failed\n);
+goto out;
+}
+
+ret = qemu_ft_trans_commit(s-file);
+if (ret  0) {
+fprintf(stderr, qemu_ft_trans_commit failed\n);
+goto out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_RECV;
+ret = 1;
+goto out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+ret = event_tap_flush_one();
+if (ret  0) {
+fprintf(stderr, event_tap_flush_one failed\n);
+goto out;
+}
+
+ft_mode =  ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC;
+} while (ft_mode != FT_TRANSACTION_BEGIN);
+
+vm_start();
+ret = 0;
+
+out:
+return ret;
+}
+
+static int migrate_ft_trans_get_ready(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_RECV) {
+fprintf(stderr,
+migrate_ft_trans_get_ready: invalid ft_mode %d\n, ft_mode);
+goto error_out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+ret = event_tap_flush_one();
+if (ret  0) {
+fprintf(stderr, event_tap_flush_one failed\n);
+goto error_out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_BEGIN;
+} else {
+ft_mode = FT_TRANSACTION_ATOMIC;
+
+ret = migrate_ft_trans_commit(s);
+if (ret  0) {
+goto error_out;
+}
+if (ret) {
+goto out;
+}
+}
+
+vm_start();
+ret = 0;
+

[PATCH 03/18] Introduce qemu_loadvm_state_no_header() and make qemu_loadvm_state() a wrapper.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Introduce qemu_loadvm_state_no_header() so that it can be called
iteratively without reading the header, and qemu_loadvm_state()
becomes a wrapper of it.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 savevm.c |   45 +++--
 1 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/savevm.c b/savevm.c
index 9cf0258..d017760 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1744,30 +1744,14 @@ typedef struct LoadStateEntry {
 int version_id;
 } LoadStateEntry;
 
-int qemu_loadvm_state(QEMUFile *f)
+static int qemu_loadvm_state_no_header(QEMUFile *f)
 {
 QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
 QLIST_HEAD_INITIALIZER(loadvm_handlers);
 LoadStateEntry *le, *new_le;
 uint8_t section_type;
-unsigned int v;
-int ret;
-
-if (qemu_savevm_state_blocked(default_mon)) {
-return -EINVAL;
-}
-
-v = qemu_get_be32(f);
-if (v != QEMU_VM_FILE_MAGIC)
-return -EINVAL;
 
-v = qemu_get_be32(f);
-if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
-return -ENOTSUP;
-}
-if (v != QEMU_VM_FILE_VERSION)
-return -ENOTSUP;
+int ret;
 
 while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
 uint32_t instance_id, version_id, section_id;
@@ -1862,6 +1846,31 @@ out:
 return ret;
 }
 
+int qemu_loadvm_state(QEMUFile *f)
+{
+unsigned int v;
+
+if (qemu_savevm_state_blocked(default_mon)) {
+return -EINVAL;
+}
+
+v = qemu_get_be32(f);
+if (v != QEMU_VM_FILE_MAGIC) {
+return -EINVAL;
+}
+
+v = qemu_get_be32(f);
+if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
+return -ENOTSUP;
+}
+if (v != QEMU_VM_FILE_VERSION) {
+return -ENOTSUP;
+}
+
+return qemu_loadvm_state_no_header(f);
+}
+
 static int bdrv_snapshot_find(BlockDriverState *bs, QEMUSnapshotInfo *sn_info,
   const char *name)
 {
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/18] virtio: decrement last_avail_idx with inuse before saving.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

For regular migration inuse == 0 always as requests are flushed before
save. However, event-tap log when enabled introduces an extra queue
for requests which is not being flushed, thus the last inuse requests
are left in the event-tap queue.  Move the last_avail_idx value sent
to the remote back to make it repeat the last inuse requests.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/virtio.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 6e8814c..d342e25 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -672,12 +672,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 qemu_put_be32(f, i);
 
 for (i = 0; i  VIRTIO_PCI_QUEUE_MAX; i++) {
+/* For regular migration inuse == 0 always as
+ * requests are flushed before save. However,
+ * event-tap log when enabled introduces an extra
+ * queue for requests which is not being flushed,
+ * thus the last inuse requests are left in the event-tap queue.
+ * Move the last_avail_idx value sent to the remote back
+ * to make it repeat the last inuse requests. */
+uint16_t last_avail = vdev-vq[i].last_avail_idx - vdev-vq[i].inuse;
 if (vdev-vq[i].vring.num == 0)
 break;
 
 qemu_put_be32(f, vdev-vq[i].vring.num);
 qemu_put_be64(f, vdev-vq[i].pa);
-qemu_put_be16s(f, vdev-vq[i].last_avail_idx);
+qemu_put_be16s(f, last_avail);
 if (vdev-binding-save_queue)
 vdev-binding-save_queue(vdev-binding_opaque, i, f);
 }
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/18] vl.c: add deleted flag for deleting the handler.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Make deleting handlers robust against deletion of any elements in a
handler by using a deleted flag like in file descriptors.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 vl.c |   15 ++-
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/vl.c b/vl.c
index 68c3b53..a18792d 100644
--- a/vl.c
+++ b/vl.c
@@ -1096,6 +1096,7 @@ static void nographic_update(void *opaque)
 struct vm_change_state_entry {
 VMChangeStateHandler *cb;
 void *opaque;
+int deleted;
 QLIST_ENTRY (vm_change_state_entry) entries;
 };
 
@@ -1116,18 +1117,22 @@ VMChangeStateEntry 
*qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
 
 void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
 {
-QLIST_REMOVE (e, entries);
-qemu_free (e);
+e-deleted = 1;
 }
 
 void vm_state_notify(int running, int reason)
 {
-VMChangeStateEntry *e;
+VMChangeStateEntry *e, *ne;
 
 trace_vm_state_notify(running, reason);
 
-for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) {
-e-cb(e-opaque, running, reason);
+QLIST_FOREACH_SAFE(e, vm_change_state_head, entries, ne) {
+if (e-deleted) {
+QLIST_REMOVE(e, entries);
+qemu_free(e);
+} else {
+e-cb(e-opaque, running, reason);
+}
 }
 }
 
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 18/18] Introduce kemari: to enable FT migration mode (Kemari).

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

When kemari: is set in front of URI of migrate command, it will turn
on ft_mode to start FT migration mode (Kemari).  On the receiver side,
the option looks like, -incoming kemari:protocol:address:port

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Acked-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 hmp-commands.hx |4 +++-
 migration.c |   12 
 qmp-commands.hx |4 +++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 834e6a8..4cd7bfa 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -760,7 +760,9 @@ ETEXI
  \n\t\t\t -b for migration without shared storage with
   full copy of disk\n\t\t\t -i for migration without 
  shared storage with incremental copy of disk 
- (base image shared between src and destination),
+ (base image shared between src and destination)
+ \n\t\t\t put \kemari:\ in front of URI to enable 
+ Fault Tolerance mode (Kemari protocol),
 .user_print = monitor_user_noop,   
.mhandler.cmd_new = do_migrate,
 },
diff --git a/migration.c b/migration.c
index d536df0..5017dea 100644
--- a/migration.c
+++ b/migration.c
@@ -48,6 +48,12 @@ int qemu_start_incoming_migration(const char *uri)
 const char *p;
 int ret;
 
+/* check ft_mode (Kemari protocol) */
+if (strstart(uri, kemari:, p)) {
+ft_mode = FT_INIT;
+uri = p;
+}
+
 if (strstart(uri, tcp:, p))
 ret = tcp_start_incoming_migration(p);
 #if !defined(WIN32)
@@ -99,6 +105,12 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject 
**ret_data)
 return -1;
 }
 
+/* check ft_mode (Kemari protocol) */
+if (strstart(uri, kemari:, p)) {
+ft_mode = FT_INIT;
+uri = p;
+}
+
 if (strstart(uri, tcp:, p)) {
 s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
  blk, inc);
diff --git a/qmp-commands.hx b/qmp-commands.hx
index fbd98ee..71e4f0e 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -437,7 +437,9 @@ EQMP
  \n\t\t\t -b for migration without shared storage with
   full copy of disk\n\t\t\t -i for migration without 
  shared storage with incremental copy of disk 
- (base image shared between src and destination),
+ (base image shared between src and destination)
+ \n\t\t\t put \kemari:\ in front of URI to enable 
+ Fault Tolerance mode (Kemari protocol),
 .user_print = monitor_user_noop,   
.mhandler.cmd_new = do_migrate,
 },
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/18] savevm: introduce qemu_savevm_trans_{begin,commit}.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Introduce qemu_savevm_trans_{begin,commit} to send the memory and
device info together, while avoiding cancelling memory state tracking.
This patch also abstracts common code between
qemu_savevm_state_{begin,iterate,commit}.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 savevm.c |  157 +++---
 sysemu.h |2 +
 2 files changed, 101 insertions(+), 58 deletions(-)

diff --git a/savevm.c b/savevm.c
index 5b57e94..dfbdc6c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1630,29 +1630,68 @@ bool qemu_savevm_state_blocked(Monitor *mon)
 return false;
 }
 
-int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-int shared)
+/*
+ * section: header to write
+ * inc: if true, forces to pass SECTION_PART instead of SECTION_START
+ * pause: if true, breaks the loop when live handler returned 0
+ */
+static int qemu_savevm_state_live(Monitor *mon, QEMUFile *f, int section,
+  bool inc, bool pause)
 {
 SaveStateEntry *se;
+int skip = 0, ret;
 
 QTAILQ_FOREACH(se, savevm_handlers, entry) {
-if(se-set_params == NULL) {
+int len, stage;
+
+if (se-save_live_state == NULL) {
 continue;
-   }
-   se-set_params(blk_enable, shared, se-opaque);
+}
+
+/* Section type */
+qemu_put_byte(f, section);
+qemu_put_be32(f, se-section_id);
+
+if (section == QEMU_VM_SECTION_START) {
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+
+stage = inc ? QEMU_VM_SECTION_PART : QEMU_VM_SECTION_START;
+} else {
+assert(inc);
+stage = section;
+}
+
+ret = se-save_live_state(mon, f, stage, se-opaque);
+if (!ret) {
+skip++;
+if (pause) {
+break;
+}
+}
 }
-
-qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
-qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+return skip;
+}
+
+static void qemu_savevm_state_full(QEMUFile *f)
+{
+SaveStateEntry *se;
 
 QTAILQ_FOREACH(se, savevm_handlers, entry) {
 int len;
 
-if (se-save_live_state == NULL)
+if (se-save_state == NULL  se-vmsd == NULL) {
 continue;
+}
 
 /* Section type */
-qemu_put_byte(f, QEMU_VM_SECTION_START);
+qemu_put_byte(f, QEMU_VM_SECTION_FULL);
 qemu_put_be32(f, se-section_id);
 
 /* ID string */
@@ -1663,9 +1702,29 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, 
int blk_enable,
 qemu_put_be32(f, se-instance_id);
 qemu_put_be32(f, se-version_id);
 
-se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque);
+vmstate_save(f, se);
+}
+
+qemu_put_byte(f, QEMU_VM_EOF);
+}
+
+int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
+int shared)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+if (se-set_params == NULL) {
+continue;
+}
+se-set_params(blk_enable, shared, se-opaque);
 }
 
+qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_START, 0, 0);
+
 if (qemu_file_has_error(f)) {
 qemu_savevm_state_cancel(mon, f);
 return -EIO;
@@ -1676,29 +1735,16 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, 
int blk_enable,
 
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
 {
-SaveStateEntry *se;
 int ret = 1;
 
-QTAILQ_FOREACH(se, savevm_handlers, entry) {
-if (se-save_live_state == NULL)
-continue;
-
-/* Section type */
-qemu_put_byte(f, QEMU_VM_SECTION_PART);
-qemu_put_be32(f, se-section_id);
-
-ret = se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque);
-if (!ret) {
-/* Do not proceed to the next vmstate before this one reported
-   completion of the current stage. This serializes the migration
-   and reduces the probability that a faster changing state is
-   synchronized over and over again. */
-break;
-}
-}
-
-if (ret)
+/* Do not proceed to the next vmstate before this one reported
+   completion of the current stage. This serializes the migration
+   and reduces the probability that a faster changing state is
+   synchronized over and over again. */
+ret = qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_PART, 1, 1);
+if (!ret) {
 return 1;
+}
 
 if

[PATCH 10/18] Call init handler of event-tap at main() in vl.c.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 vl.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index a18792d..2dbda4d 100644
--- a/vl.c
+++ b/vl.c
@@ -160,6 +160,7 @@ int main(int argc, char **argv)
 #include qemu-queue.h
 #include cpus.h
 #include arch_init.h
+#include event-tap.h
 
 #include ui/qemu-spice.h
 
@@ -2974,6 +2975,8 @@ int main(int argc, char **argv, char **envp)
 
 blk_mig_init();
 
+event_tap_init();
+
 /* open the virtual block devices */
 if (snapshot)
 qemu_opts_foreach(qemu_find_opts(drive), drive_enable_snapshot, 
NULL, 0);
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/18] ioport: insert event_tap_ioport() to ioport_write().

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Record ioport event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 ioport.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index 2e971fa..f485bab 100644
--- a/ioport.c
+++ b/ioport.c
@@ -27,6 +27,7 @@
 
 #include ioport.h
 #include trace.h
+#include event-tap.h
 
 /***/
 /* IO Port */
@@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, 
uint32_t data)
 default_ioport_writel
 };
 IOPortWriteFunc *func = ioport_write_table[index][address];
+event_tap_ioport(index, address, data);
 if (!func)
 func = default_func[index];
 func(ioport_opaque[address], address, data);
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/18] block: insert event-tap to bdrv_aio_writev(), bdrv_aio_flush() and bdrv_flush().

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

event-tap function is called only when it is on, and requests were
sent from device emulators.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Acked-by: Kevin Wolf kw...@redhat.com
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 block.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index f731c7a..9e6b610 100644
--- a/block.c
+++ b/block.c
@@ -28,6 +28,7 @@
 #include block_int.h
 #include module.h
 #include qemu-objects.h
+#include event-tap.h
 
 #ifdef CONFIG_BSD
 #include sys/types.h
@@ -1591,6 +1592,10 @@ int bdrv_flush(BlockDriverState *bs)
 }
 
 if (bs-drv  bs-drv-bdrv_flush) {
+if (*bs-device_name  event_tap_is_on()) {
+event_tap_bdrv_flush();
+}
+
 return bs-drv-bdrv_flush(bs);
 }
 
@@ -2226,6 +2231,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
 if (bdrv_check_request(bs, sector_num, nb_sectors))
 return NULL;
 
+if (*bs-device_name  event_tap_is_on()) {
+return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
+ cb, opaque);
+}
+
 if (bs-dirty_bitmap) {
 blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
  opaque);
@@ -2499,6 +2509,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
 
 if (!drv)
 return NULL;
+
+if (*bs-device_name  event_tap_is_on()) {
+return event_tap_bdrv_aio_flush(bs, cb, opaque);
+}
+
 return drv-bdrv_aio_flush(bs, cb, opaque);
 }
 
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/18] Introduce event-tap.

2011-04-25 Thread OHMURA Kei

event-tap controls when to start FT transaction, and provides proxy
functions to called from net/block devices.  While FT transaction, it
queues up net/block requests, and flush them when the transaction gets
completed.

Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 Makefile.target |1 +
 event-tap.c |  940 +++
 event-tap.h |   44 +++
 qemu-tool.c |   28 ++
 trace-events|   10 +
 5 files changed, 1023 insertions(+), 0 deletions(-)
 create mode 100644 event-tap.c
 create mode 100644 event-tap.h

diff --git a/Makefile.target b/Makefile.target
index 0e0ef36..e489df4 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,7 @@ obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
 LIBS+=-lz
+obj-y += event-tap.o
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
diff --git a/event-tap.c b/event-tap.c
new file mode 100644
index 000..95c147a
--- /dev/null
+++ b/event-tap.c
@@ -0,0 +1,940 @@
+/*
+ * Event Tap functions for QEMU
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include block.h
+#include block_int.h
+#include ioport.h
+#include osdep.h
+#include sysemu.h
+#include hw/hw.h
+#include net.h
+#include event-tap.h
+#include trace.h
+
+enum EVENT_TAP_STATE {
+EVENT_TAP_OFF,
+EVENT_TAP_ON,
+EVENT_TAP_SUSPEND,
+EVENT_TAP_FLUSH,
+EVENT_TAP_LOAD,
+EVENT_TAP_REPLAY,
+};
+
+static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF;
+
+typedef struct EventTapIOport {
+uint32_t address;
+uint32_t data;
+int  index;
+} EventTapIOport;
+
+#define MMIO_BUF_SIZE 8
+
+typedef struct EventTapMMIO {
+uint64_t address;
+uint8_t  buf[MMIO_BUF_SIZE];
+int  len;
+} EventTapMMIO;
+
+typedef struct EventTapNetReq {
+char *device_name;
+int iovcnt;
+int vlan_id;
+bool vlan_needed;
+bool async;
+struct iovec *iov;
+NetPacketSent *sent_cb;
+} EventTapNetReq;
+
+#define MAX_BLOCK_REQUEST 32
+
+typedef struct EventTapAIOCB EventTapAIOCB;
+
+typedef struct EventTapBlkReq {
+char *device_name;
+int num_reqs;
+int num_cbs;
+bool is_flush;
+BlockRequest reqs[MAX_BLOCK_REQUEST];
+EventTapAIOCB *acb[MAX_BLOCK_REQUEST];
+} EventTapBlkReq;
+
+#define EVENT_TAP_IOPORT (1  0)
+#define EVENT_TAP_MMIO   (1  1)
+#define EVENT_TAP_NET(1  2)
+#define EVENT_TAP_BLK(1  3)
+
+#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1)
+
+typedef struct EventTapLog {
+int mode;
+union {
+EventTapIOport ioport;
+EventTapMMIO mmio;
+};
+union {
+EventTapNetReq net_req;
+EventTapBlkReq blk_req;
+};
+QTAILQ_ENTRY(EventTapLog) node;
+} EventTapLog;
+
+struct EventTapAIOCB {
+BlockDriverAIOCB common;
+BlockDriverAIOCB *acb;
+bool is_canceled;
+};
+
+static EventTapLog *last_event_tap;
+
+static QTAILQ_HEAD(, EventTapLog) event_list;
+static QTAILQ_HEAD(, EventTapLog) event_pool;
+
+static int (*event_tap_cb)(void);
+static QEMUBH *event_tap_bh;
+static VMChangeStateEntry *vmstate;
+
+static void event_tap_bh_cb(void *p)
+{
+if (event_tap_cb) {
+event_tap_cb();
+}
+
+qemu_bh_delete(event_tap_bh);
+event_tap_bh = NULL;
+}
+
+static void event_tap_schedule_bh(void)
+{
+trace_event_tap_ignore_bh(!!event_tap_bh);
+
+/* if bh is already set, we ignore it for now */
+if (event_tap_bh) {
+return;
+}
+
+event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL);
+qemu_bh_schedule(event_tap_bh);
+
+return;
+}
+
+static void *event_tap_alloc_log(void)
+{
+EventTapLog *log;
+
+if (QTAILQ_EMPTY(event_pool)) {
+log = qemu_mallocz(sizeof(EventTapLog));
+} else {
+log = QTAILQ_FIRST(event_pool);
+QTAILQ_REMOVE(event_pool, log, node);
+}
+
+return log;
+}
+
+static void event_tap_free_net_req(EventTapNetReq *net_req);
+static void event_tap_free_blk_req(EventTapBlkReq *blk_req);
+
+static void event_tap_free_log(EventTapLog *log)
+{
+int mode = log-mode  ~EVENT_TAP_TYPE_MASK;
+
+if (mode == EVENT_TAP_NET) {
+event_tap_free_net_req(log-net_req);
+} else if (mode == EVENT_TAP_BLK) {
+event_tap_free_blk_req(log-blk_req);
+}
+
+log-mode = 0;
+
+/* return the log to event_pool */
+QTAILQ_INSERT_HEAD(event_pool, log, node);
+}
+
+static void event_tap_free_pool(void)
+{
+EventTapLog *log, *next;
+
+QTAILQ_FOREACH_SAFE(log, event_pool, node, next) {
+QTAILQ_REMOVE(event_pool, log, node);
+qemu_free(log);
+}
+}
+
+static void event_tap_free_net_req(EventTapNetReq *net_req)
+{
+int i;
+
+if (!net_req-async) {
+for

[PATCH 12/18] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

Record mmio write event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 exec.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index c3dc68a..3c3cece 100644
--- a/exec.c
+++ b/exec.c
@@ -33,6 +33,7 @@
 #include osdep.h
 #include kvm.h
 #include qemu-timer.h
+#include event-tap.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
 #include signal.h
@@ -3736,6 +3737,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 io_index = (pd  IO_MEM_SHIFT)  (IO_MEM_NB_ENTRIES - 1);
 if (p)
 addr1 = (addr  ~TARGET_PAGE_MASK) + p-region_offset;
+
+event_tap_mmio(addr, buf, len);
+
 /* XXX: could force cpu_single_env to NULL to avoid
potential bugs */
 if (l = 4  ((addr1  3) == 0)) {
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/18] Kemari for KVM v0.2.14

2011-04-25 Thread OHMURA Kei

Hi,

This patch series is a revised version of Kemari for KVM. The current 
code is based on qemu.git ec52b8753a372de30b22d9b4765a799db612.

The changes from v0.2.13 - v0.2.14 are:

- rebased to latest.
- correct patch[07], [09] author.

The changes from v0.2.12 - v0.2.13 are:

- replaced qemu_get_timer() with qemu_get_timer_ns()
- check check s-file before calling qemu_ft_trans_cancel()
- avoid virtio-net assert upon calling event_tap_unregister()

The changes from v0.2.11 - v0.2.12 are:

- fix vm_state_notify() to use QLIST_FOREACH_SAFE (Juan)
- introduce qemu_loadvm_state_no_header() and refactored
 qemu_loadvm_state() to call it after checking headers (Juan)

The changes from v0.2.10 - v0.2.11 are:

- rebased to 0.14
- upon unregistering event-tap, set event_tap_state after event_tap_flush
- modify commit log of 02/18 that it won't make existing migration
 bi-directional.

The changes from v0.2.9 - v0.2.10 are:

- change migrate format to kemari:protocol:host:port (Paolo)

The changes from v0.2.8 - v0.2.9 are:

- abstract common code between qemu_savevm_{state,trans}_* (Paolo)
- change incoming format to kemari:protocol:host:port (Paolo)

The changes from v0.2.7 - v0.2.8 are:

- fixed calling wrong cb in event-tap
- add missing qemu_aio_release in event-tap

The changes from v0.2.6 - v0.2.7 are:

- add AIOCB, AIOPool and cancel functions (Kevin)
- insert event-tap for bdrv_flush (Kevin)
- add error handing when calling bdrv functions (Kevin)
- fix usage of qemu_aio_flush and bdrv_flush (Kevin)
- use bs in AIOCB on the primary (Kevin)
- reorder event-tap functions to gather with block/net (Kevin)
- fix checking bs-device_name (Kevin)

The changes from v0.2.5 - v0.2.6 are:

- use qemu_{put,get}_be32() to save/load niov in event-tap

The changes from v0.2.4 - v0.2.5 are:

- fixed braces and trailing spaces by using Blue's checkpatch.pl (Blue)
- event-tap: don't try to send blk_req if it's a bdrv_aio_flush event

The changes from v0.2.3 - v0.2.4 are:

- call vm_start() before event_tap_flush_one() to avoid failure in
 virtio-net assertion
- add vm_change_state_handler to turn off ft_mode
- use qemu_iovec functions in event-tap
- remove duplicated code in migration
- remove unnecessary new line for error_report in ft_trans_file

The changes from v0.2.2 - v0.2.3 are:

- queue async net requests without copying (MST)
-- if not async, contents of the packets are sent to the secondary
- better description for option -k (MST)
- fix memory transfer failure
- fix ft transaction initiation failure

The changes from v0.2.1 - v0.2.2 are:

- decrement last_avaid_idx with inuse before saving (MST)
- remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit()

The changes from v0.2 - v0.2.1 are:

- Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin)
- Tap bdrv_aio_flush (Marcelo)
- Remove multiwrite interface in event-tap (Stefan)
- Fix event-tap to use pio/mmio to replay both net/block (Stefan)
- Improve error handling in event-tap (Stefan)
- Fix leak in event-tap (Stefan)
- Revise virtio last_avail_idx manipulation (MST)
- Clean up migration.c hook (Marcelo)
- Make deleting change state handler robust (Isaku, Anthony)

The changes from v0.1.1 - v0.2 are:

- Introduce a queue in event-tap to make VM sync live.
- Change transaction receiver to a state machine for async receiving.
- Replace net/block layer functions with event-tap proxy functions.
- Remove dirty bitmap optimization for now.
- convert DPRINTF() in ft_trans_file to trace functions.
- convert fprintf() in ft_trans_file to error_report().
- improved error handling in ft_trans_file.
- add a tmp pointer to qemu_del_vm_change_state_handler.

The changes from v0.1 - v0.1.1 are:

- events are tapped in net/block layer instead of device emulation layer.
- Introduce a new option for -incoming to accept FT transaction.

- Removed writev() support to QEMUFile and FdMigrationState for now.
 I would post this work in a different series.

- Modified virtio-blk save/load handler to send inuse variable to
 correctly replay.

- Removed configure --enable-ft-mode.
- Removed unnecessary check for qemu_realloc().

The first 6 patches modify several functions of qemu to prepare
introducing Kemari specific components.

The next 6 patches are the components of Kemari.  They introduce
event-tap and the FT transaction protocol file based on buffered file.
The design document of FT transaction protocol can be found at,
http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf

Then the following 2 patches modifies net/block layer functions with
event-tap functions.  Please note that if Kemari is off, event-tap
will just passthrough, and there is most no intrusion to exisiting
functions including normal live migration.

Finally, the migration layer are modified to support Kemari in the
last 4 patches.  Again, there shouldn't be any affection if a user
doesn't specify Kemari specific options.  The transaction is now async
on both sender

[PATCH 07/18] Introduce fault tolerant VM transaction QEMUFile and ft_mode.

2011-04-25 Thread OHMURA Kei

This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.

Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 Makefile.objs   |1 +
 ft_trans_file.c |  624 +++
 ft_trans_file.h |   72 +++
 migration.c |3 +
 trace-events|   15 ++
 5 files changed, 715 insertions(+), 0 deletions(-)
 create mode 100644 ft_trans_file.c
 create mode 100644 ft_trans_file.h

diff --git a/Makefile.objs b/Makefile.objs
index 44ce368..75e7c79 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -101,6 +101,7 @@ common-obj-y += qdev.o qdev-properties.o
 common-obj-y += block-migration.o iohandler.o
 common-obj-y += pflib.o
 common-obj-y += bitmap.o bitops.o
+common-obj-y += ft_trans_file.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_trans_file.c b/ft_trans_file.c
new file mode 100644
index 000..2b42b95
--- /dev/null
+++ b/ft_trans_file.c
@@ -0,0 +1,624 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguorialigu...@us.ibm.com
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include hw/hw.h
+#include qemu-timer.h
+#include sysemu.h
+#include qemu-char.h
+#include trace.h
+#include ft_trans_file.h
+
+typedef struct FtTransHdr
+{
+uint16_t cmd;
+uint16_t id;
+uint32_t seq;
+uint32_t payload_len;
+} FtTransHdr;
+
+typedef struct QEMUFileFtTrans
+{
+FtTransPutBufferFunc *put_buffer;
+FtTransGetBufferFunc *get_buffer;
+FtTransPutReadyFunc *put_ready;
+FtTransGetReadyFunc *get_ready;
+FtTransWaitForUnfreezeFunc *wait_for_unfreeze;
+FtTransCloseFunc *close;
+void *opaque;
+QEMUFile *file;
+
+enum QEMU_VM_TRANSACTION_STATE state;
+uint32_t seq;
+uint16_t id;
+
+int has_error;
+
+bool freeze_output;
+bool freeze_input;
+bool rate_limit;
+bool is_sender;
+bool is_payload;
+
+uint8_t *buf;
+size_t buf_max_size;
+size_t put_offset;
+size_t get_offset;
+
+FtTransHdr header;
+size_t header_offset;
+} QEMUFileFtTrans;
+
+#define IO_BUF_SIZE 32768
+
+static void ft_trans_append(QEMUFileFtTrans *s,
+const uint8_t *buf, size_t size)
+{
+if (size  (s-buf_max_size - s-put_offset)) {
+trace_ft_trans_realloc(s-buf_max_size, size + 1024);
+s-buf_max_size += size + 1024;
+s-buf = qemu_realloc(s-buf, s-buf_max_size);
+}
+
+trace_ft_trans_append(size);
+memcpy(s-buf + s-put_offset, buf, size);
+s-put_offset += size;
+}
+
+static void ft_trans_flush(QEMUFileFtTrans *s)
+{
+size_t offset = 0;
+
+if (s-has_error) {
+error_report(flush when error %d, bailing, s-has_error);
+return;
+}
+
+while (offset  s-put_offset) {
+ssize_t ret;
+
+ret = s-put_buffer(s-opaque, s-buf + offset, s-put_offset - 
offset);
+if (ret == -EAGAIN) {
+break;
+}
+
+if (ret = 0) {
+error_report(error flushing data, %s, strerror(errno));
+s-has_error = FT_TRANS_ERR_FLUSH;
+break;
+} else {
+offset += ret;
+}
+}
+
+trace_ft_trans_flush(offset, s-put_offset);
+memmove(s-buf, s-buf + offset, s-put_offset - offset);
+s-put_offset -= offset;
+s-freeze_output = !!s-put_offset;
+}
+
+static ssize_t ft_trans_put(void *opaque, void *buf, int size)
+{
+QEMUFileFtTrans *s = opaque;
+size_t offset = 0;
+ssize_t len;
+
+/* flush buffered data before putting next */
+if (s-put_offset) {
+ft_trans_flush(s);
+}
+
+while (!s-freeze_output  offset  size) {
+len = s-put_buffer(s-opaque, (uint8_t *)buf + offset, size - offset);
+
+if (len == -EAGAIN) {
+trace_ft_trans_freeze_output();
+s-freeze_output = 1;
+break;
+}
+
+if (len = 0) {
+error_report(putting data failed, %s, strerror(errno));
+s-has_error = 1;
+offset = -EINVAL;
+break;
+}
+
+offset += len;
+}
+
+if (s-freeze_output) {
+ft_trans_append(s, buf + offset, size - offset);
+offset = size;
+}
+
+return offset;
+}
+
+static int ft_trans_send_header(QEMUFileFtTrans *s,
+enum QEMU_VM_TRANSACTION_STATE state,
+uint32_t payload_len)
+{
+int ret;
+

[PATCH 17/18] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.

2011-04-25 Thread OHMURA Kei

From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp

When ft_mode is set in the header, tcp_accept_incoming_migration()
sets ft_trans_incoming() as a callback, and call
qemu_file_get_notify() to receive FT transaction iteratively.  We also
need a hack no to close fd before moving to ft_transaction mode, so
that we can reuse the fd for it.  vm_change_state_handler is added to
turn off ft_mode when cont is pressed.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 migration-tcp.c |   68 ++-
 1 files changed, 67 insertions(+), 1 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index bb67d53..1eeac2b 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -17,6 +17,9 @@
 #include qemu-char.h
 #include buffered_file.h
 #include block.h
+#include sysemu.h
+#include ft_trans_file.h
+#include event-tap.h
 
 //#define DEBUG_MIGRATION_TCP
 
@@ -28,6 +31,8 @@
 do { } while (0)
 #endif
 
+static VMChangeStateEntry *vmstate;
+
 static int socket_errno(FdMigrationState *s)
 {
 return socket_error();
@@ -55,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, 
size_t size)
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF(tcp_close\n);
-if (s-fd != -1) {
+/* FIX ME: accessing ft_mode here isn't clean */
+if (s-fd != -1  ft_mode != FT_INIT) {
 close(s-fd);
 s-fd = -1;
 }
@@ -149,6 +155,36 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 return s-mig_state;
 }
 
+static void ft_trans_incoming(void *opaque)
+{
+QEMUFile *f = opaque;
+
+qemu_file_get_notify(f);
+if (qemu_file_has_error(f)) {
+ft_mode = FT_ERROR;
+qemu_fclose(f);
+}
+}
+
+static void ft_trans_reset(void *opaque, int running, int reason)
+{
+QEMUFile *f = opaque;
+
+if (running) {
+if (ft_mode != FT_ERROR) {
+qemu_fclose(f);
+}
+ft_mode = FT_OFF;
+qemu_del_vm_change_state_handler(vmstate);
+}
+}
+
+static void ft_trans_schedule_replay(QEMUFile *f)
+{
+event_tap_schedule_replay();
+vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f);
+}
+
 static void tcp_accept_incoming_migration(void *opaque)
 {
 struct sockaddr_in addr;
@@ -174,8 +210,38 @@ static void tcp_accept_incoming_migration(void *opaque)
 goto out;
 }
 
+if (ft_mode == FT_INIT) {
+autostart = 0;
+}
+
 process_incoming_migration(f);
+
+if (ft_mode == FT_INIT) {
+int ret;
+
+socket_set_nodelay(c);
+
+f = qemu_fopen_ft_trans(s, c);
+if (f == NULL) {
+fprintf(stderr, could not qemu_fopen_ft_trans\n);
+goto out;
+}
+
+/* need to wait sender to setup */
+ret = qemu_ft_trans_begin(f);
+if (ret  0) {
+goto out;
+}
+
+qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f);
+ft_trans_schedule_replay(f);
+ft_mode = FT_TRANSACTION_RECV;
+
+return;
+}
+
 qemu_fclose(f);
+
 out:
 close(c);
 out2:
-- 
1.7.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Connecting to a new guest VM

2011-04-25 Thread Emmanuel Noobadmin

This is probably a very noob question but I haven't been able to find
a solution that worked so far. Maybe it's just something really minor
that I've missed so I'll appreciate some pointers.

Running on Scientific Linux 6, bridged networking configured with
ifcfg-br0 and ifcfg-eth0, networking is working, I can ssh/vnc into
the host.

I created a guest using the following command as root following the
virt-install man page.

virt-install -n vm_01 -r 640 --vcpus=1
--file=/home/VMs/vm110401/vm_01_d1 -s 170 --nonsparse
--network=bridge:br0  --accelerate
--cdrom=/home/ISO/CentOS-5.6-x86_64-bin-DVD-1of2.iso --os-type=linux
--os-variant=rhel5

It seems to work, except I get a line that says Escape Char is ^]
And the console doesn't react to any further input except to exit.
Then it warns me that the OS is still being installed.

Being a noob, I figured maybe a GUI will be easier. So I installed X
desktop and created another VM with the same parameters except I added
--vnc --vncport=15901

However, I cannot connect to the VM, whether using the public IP or
through the LAN IP.

I have the vnc port allowed in iptables, the port is not the default
5901 since I already have the external VNC listening on that port.

I've also tried to connect to the VM via 127.0.0.1 through my VNC
session but depending on what I try (public, LAN, vnc from within vnc
to localhost) I get either connection refused or write: broken
pipe error.

Based on some google searches, I've also edited qemu.conf to include
the line vnc_listen= 0.0.0.0

But still no joy and from googling, apparently I'm not the only noob
who find myself stuck. So I'll appreciate it greatly if somebody could
point out what's missing or wrong, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/18] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.

2011-04-25 Thread Jan Kiszka

On 2011-04-25 13:00, OHMURA Kei wrote:
 From: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 
 Record mmio write event to replay it upon failover.
 
 Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
 ---
  exec.c |4 
  1 files changed, 4 insertions(+), 0 deletions(-)
 
 diff --git a/exec.c b/exec.c
 index c3dc68a..3c3cece 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -33,6 +33,7 @@
  #include osdep.h
  #include kvm.h
  #include qemu-timer.h
 +#include event-tap.h
  #if defined(CONFIG_USER_ONLY)
  #include qemu.h
  #include signal.h
 @@ -3736,6 +3737,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
 uint8_t *buf,
  io_index = (pd  IO_MEM_SHIFT)  (IO_MEM_NB_ENTRIES - 1);
  if (p)
  addr1 = (addr  ~TARGET_PAGE_MASK) + p-region_offset;
 +
 +event_tap_mmio(addr, buf, len);
 +

You know that this is incomplete? A few devices are calling st*_phys
directly, specifically virtio.

What kind of mmio should be traced here, device or CPU originated? Or both?

Jan



signature.asc
Description: OpenPGP digital signature

Re: A Live Backup feature for KVM

2011-04-25 Thread Stefan Hajnoczi

On Mon, Apr 25, 2011 at 9:16 AM, Jagane Sundar jag...@sundar.org wrote:
 The direction that I chose to go is slightly different. In both of the
 proposals you pointed me at, the original virtual disk is made
 read-only and the VM writes to a different COW file. After backup
 of the original virtual disk file is complete, the COW file is merged
 with the original vdisk file.

 Instead, I create an Original-Blocks-COW-file to store the original
 blocks that are overwritten by the VM everytime the VM performs
 a write while the backup is in progress. Livebackup copies these
 underlying blocks from the original virtual disk file before the VM's
 write to the original virtual disk file is scheduled. The advantage of
 this is that there is no merge necessary at the end of the backup, we
 can simply delete the Original-Blocks-COW-file.

The advantage of the approach that redirects writes to a new file
instead is that the heavy work of copying data is done asynchronously
during the merge operation instead of in the write path which will
impact guest performance.

Here's what I understand:

1. User takes a snapshot of the disk, QEMU creates old-disk.img backed
by the current-disk.img.
2. Guest issues a write A.
3. QEMU reads B from current-disk.img.
4. QEMU writes B to old-disk.img.
5. QEMU writes A to current-disk.img.
6. Guest receives write completion A.

The tricky thing is what happens if there is a failure after Step 5.
If writes A and B were unstable writes (no fsync()) then no ordering
is guaranteed and perhaps write A reached current-disk.img but write B
did not reach old-disk.img.  In this case we no longer have a
consistent old-disk.img snapshot - we're left with an updated
current-disk.img and old-disk.img does not have a copy of the old
data.

The solution is to fsync() after Step 4 and before Step 5 but this
will hurt performance.  We now have an extra read, write, and fsync()
on every write.

 I have some reasons to believe that the Original-Blocks-COW-file
 design that I am putting forth might work better. I have listed them
 below. (It's past midnight here, so pardon me if it sounds garbled -- I
 will try to clarify more in a writeup on wiki.qemu.org).
 Let me know what your thoughts are..

 I feel that the livebackup mechanism will impact the running VM
 less. For example, if something goes wrong with the backup process,
 then we can simply delete the Original-Blocks-COW-file and force
 the backup client to do a full backup the next time around. The
 running VM or its virtual disks are not impacted at all.

Abandoning snapshots is not okay.  Snapshots will be used in scenarios
beyond backup and I don't think we can make them
unreliable/throw-away.

 Livebackup includes a rudimentary network protocol to transfer
 the modified blocks to a livebackup_client. It supports incremental
 backups. Also, livebackup treats a backup as containing all the virtual
 disks of a VM. Hence a snapshot in livebackup terms refer to a
 snapshot of all the virtual disks.

 The approximate sequence of operation is as follows:
 1. VM boots up. When bdrv_open_common opens any file backed
    virtual disk, it checks for a file called base_file.livebackupconf.
    If such a file exists, then the virtual disk is part of the backup set,
    and a chunk of memory is allocated to keep track of dirty blocks.
 2. qemu starts up a  livebackup thread that listens on a specified port
    (e.g) port 7900, for connections from the livebackup client.
 3. The livebackup_client connects to qemu at port 7900.
 4. livebackup_client sends a 'do snapshot' command.
 5. qemu waits 30 seconds for outstanding asynchronous I/O to complete.
 6. When there are no more outstanding async I/O requests, qemu
    copies the dirty_bitmap to its snapshot structure and starts a new dirty
    bitmap.
 7. livebackup_client starts iterating through the list of dirty blocks, and
    starts saving these blocks to the backup image
 8. When all blocks have been backed up, then the backup_client sends a
    destroy snapshot command; the server simply deletes the
    Original-Blocks-COW-files for each of the virtual disks and frees the
    calloc'd memory holding the dirty blocks list.

I think there's a benefit to just pointing at
Original-Blocks-COW-files and letting the client access it directly.
This even works with shared storage where the actual backup work is
performed on another host via access to a shared network filesystem or
LUN.  It may not be desirable to send everything over the network.


Perhaps you made a custom network client because you are writing a
full-blown backup solution for KVM?  In that case it's your job to
move the data around and get it backed up.  But from QEMU's point of
view we just need to provide the data and it's up to the backup
software to send it over the network and do its magic.

 I have pushed my code to the following git tree.
 git://github.com/jagane/qemu-kvm-livebackup.git

 It started as a clone of the linux kvm tree

Re: [Qemu-devel] [PATCH 2/2 V7] qemu,qmp: add inject-nmi qmp command

2011-04-25 Thread Luiz Capitulino

On Wed, 20 Apr 2011 09:53:56 +0800
Lai Jiangshan la...@cn.fujitsu.com wrote:

 On 04/04/2011 09:09 PM, Anthony Liguori wrote:
  On 04/04/2011 07:19 AM, Markus Armbruster wrote:
  [Note cc: Anthony]
 
  Daniel P. Berrangeberra...@redhat.com  writes:
 
  On Mon, Mar 07, 2011 at 05:46:28PM +0800, Lai Jiangshan wrote:
  From: Lai Jiangshanla...@cn.fujitsu.com
  Date: Mon, 7 Mar 2011 17:05:15 +0800
  Subject: [PATCH 2/2] qemu,qmp: add inject-nmi qmp command
 
  inject-nmi command injects an NMI on all CPUs of guest.
  It is only supported for x86 guest currently, it will
  returns Unsupported error for non-x86 guest.
 
  ---
hmp-commands.hx |2 +-
monitor.c   |   18 +-
qmp-commands.hx |   29 +
3 files changed, 47 insertions(+), 2 deletions(-)
  Does anyone have any feedback on this addition, or are all new
  QMP patch proposals blocked pending Anthony's QAPI work ?
  That would be bad.  Anthony, what's holding this back?
  
  It doesn't pass checkpath.pl.
  
  But I'd also expect this to come through Luiz's QMP tree.
  
  Regards,
  
  Anthony Liguori
  
 
 Hi, Anthony,
 
 I cannot find checkpath.pl in the source tree.

It's ./scripts/checkpatch.pl

 And how/where to write errors descriptions? Is the following description
 suitable?
 
 ##
 # @inject-nmi:
 #
 # Inject an NMI on the guest.
 #
 # Returns: Nothing on success.
 #  If the guest(non-x86) does not support NMI injection, Unsupported
 #
 # Since: 0.15.0
 ##
 { 'command': 'inject-nmi' }
 
 
 Thanks,
 Lai
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern

Running qemu-kvm.git as of today (ffce28f, April 18, 2011) the virtual
function passed to the VM is losing its assigned mac address. That is,
prior to launching qemu-kvm, the following command is run to set the MAC
address:

ip link set dev eth2 vf 0 mac 02:12:34:56:79:20

Yet, when the VM boots the MAC address is random which is what happens
when the VF is reset. Looking through the commit logs between 0.13.0 --
the version in Fedora 14 -- and latest git I found the following:

commit d9488459ff2ab113293586c1c36b1679bb15deee
Author: Alex Williamson alex.william...@redhat.com
Date:   Thu Mar 17 15:24:31 2011 -0600

device-assignment: Reset device on system reset

On system reset, we currently try to quiesce DMA by clearing the
command register.  This assumes that nothing re-enables bus master
support without first de-programming the device.  Use a bigger
hammer to help the guest not shoot itself by issuing a function
reset via sysfs on each system reset.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Acked-by: Chris Wright chr...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


Is this the cause of the MAC address reset and is this behavior intended?

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 10:28 -0600, David Ahern wrote:
 Running qemu-kvm.git as of today (ffce28f, April 18, 2011) the virtual
 function passed to the VM is losing its assigned mac address. That is,
 prior to launching qemu-kvm, the following command is run to set the MAC
 address:
 
 ip link set dev eth2 vf 0 mac 02:12:34:56:79:20
 
 Yet, when the VM boots the MAC address is random which is what happens
 when the VF is reset. Looking through the commit logs between 0.13.0 --
 the version in Fedora 14 -- and latest git I found the following:
 
 commit d9488459ff2ab113293586c1c36b1679bb15deee
 Author: Alex Williamson alex.william...@redhat.com
 Date:   Thu Mar 17 15:24:31 2011 -0600
 
 device-assignment: Reset device on system reset
 
 On system reset, we currently try to quiesce DMA by clearing the
 command register.  This assumes that nothing re-enables bus master
 support without first de-programming the device.  Use a bigger
 hammer to help the guest not shoot itself by issuing a function
 reset via sysfs on each system reset.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Acked-by: Chris Wright chr...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
 Is this the cause of the MAC address reset and is this behavior intended?

Ugh, I hope not, it's certainly not an intended side effect.  Can you
see if the problem still happens if you revert this patch?  If it does,
we might need more device specific reset functions to save and restore
that extra bit of state.  I assume this is still the 82576 VF you were
asking about before?  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern



On 04/25/11 10:37, Alex Williamson wrote:
 On Mon, 2011-04-25 at 10:28 -0600, David Ahern wrote:
 Running qemu-kvm.git as of today (ffce28f, April 18, 2011) the virtual
 function passed to the VM is losing its assigned mac address. That is,
 prior to launching qemu-kvm, the following command is run to set the MAC
 address:

 ip link set dev eth2 vf 0 mac 02:12:34:56:79:20

 Yet, when the VM boots the MAC address is random which is what happens
 when the VF is reset. Looking through the commit logs between 0.13.0 --
 the version in Fedora 14 -- and latest git I found the following:

 commit d9488459ff2ab113293586c1c36b1679bb15deee
 Author: Alex Williamson alex.william...@redhat.com
 Date:   Thu Mar 17 15:24:31 2011 -0600

 device-assignment: Reset device on system reset

 On system reset, we currently try to quiesce DMA by clearing the
 command register.  This assumes that nothing re-enables bus master
 support without first de-programming the device.  Use a bigger
 hammer to help the guest not shoot itself by issuing a function
 reset via sysfs on each system reset.

 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Acked-by: Chris Wright chr...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


 Is this the cause of the MAC address reset and is this behavior intended?
 
 Ugh, I hope not, it's certainly not an intended side effect.  Can you
 see if the problem still happens if you revert this patch?  If it does,

I commented out the write() in the reset function and indeed the mac
address was not reset on VM boot.

 we might need more device specific reset functions to save and restore
 that extra bit of state.  I assume this is still the 82576 VF you were
 asking about before?  Thanks,

Yes. I got distracted end of last week. Response to that thread coming soon.

David


 
 Alex
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] qemu-kvm: pci-assign: Mapping fixes

2011-04-25 Thread Alex Williamson

On Sat, 2011-04-23 at 12:05 +0200, Jan Kiszka wrote:
 The promised cleanups.
 
 
 
 Jan Kiszka (3):
   qemu-kvm: pci-assign: Clean up free_assigned_device
   qemu-kvm: pci-assign: Remove dead code from assigned_dev_iomem_map
   qemu-kvm: pci-assign: Consolidate and fix slow mmio region mappings
 
  hw/device-assignment.c |  139 ++-
  1 files changed, 53 insertions(+), 86 deletions(-)
 

Looks good to me.

Acked-by: Alex Williamson alex.william...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC PATCH 0/3 V8] QAPI: add inject-nmi qmp command

2011-04-25 Thread Michael Roth


On 04/20/2011 01:19 AM, Lai Jiangshan wrote:



These patches are applied for http://repo.or.cz/r/qemu/aliguori.git glib.

These patches add QAPI inject-nmi. They are passed checkpatch.pl and the build.

But the result qemu executable file is not tested, because the result
qemu of http://repo.or.cz/r/qemu/aliguori.git glib can't work in my box.


What issues are you seeing using the binary from the glib tree? AFAIK 
that tree should be functional, except potentially with TCG. I've only 
been using it with KVM and --enable-io-thread though so don't know for sure.




Lai Jiangshan (3):
   QError: Introduce QERR_UNSUPPORTED
   qapi,nmi: add inject-nmi qmp command
   qapi-hmp: Convert HMP nmi to use QMP

  hmp-commands.hx  |   18 --
  hmp.c|   12 
  hmp.h|1 +
  monitor.c|   14 --
  qapi-schema.json |   12 
  qerror.c |4 
  qerror.h |3 +++
  qmp.c|   17 +
  8 files changed, 57 insertions(+), 24 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 10:41 -0600, David Ahern wrote:
 
 On 04/25/11 10:37, Alex Williamson wrote:
  On Mon, 2011-04-25 at 10:28 -0600, David Ahern wrote:
  Running qemu-kvm.git as of today (ffce28f, April 18, 2011) the virtual
  function passed to the VM is losing its assigned mac address. That is,
  prior to launching qemu-kvm, the following command is run to set the MAC
  address:
 
  ip link set dev eth2 vf 0 mac 02:12:34:56:79:20
 
  Yet, when the VM boots the MAC address is random which is what happens
  when the VF is reset. Looking through the commit logs between 0.13.0 --
  the version in Fedora 14 -- and latest git I found the following:
 
  commit d9488459ff2ab113293586c1c36b1679bb15deee
  Author: Alex Williamson alex.william...@redhat.com
  Date:   Thu Mar 17 15:24:31 2011 -0600
 
  device-assignment: Reset device on system reset
 
  On system reset, we currently try to quiesce DMA by clearing the
  command register.  This assumes that nothing re-enables bus master
  support without first de-programming the device.  Use a bigger
  hammer to help the guest not shoot itself by issuing a function
  reset via sysfs on each system reset.
 
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  Acked-by: Chris Wright chr...@redhat.com
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
  Is this the cause of the MAC address reset and is this behavior intended?
  
  Ugh, I hope not, it's certainly not an intended side effect.  Can you
  see if the problem still happens if you revert this patch?  If it does,
 
 I commented out the write() in the reset function and indeed the mac
 address was not reset on VM boot.

Ok, here's what I see on my system:

# modprobe igbvf
# dmesg | grep igbvf \:01\:11.5\: Address\:
igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
# modprobe -r igbvf
# echo 1  /sys/bus/pci/devices/:01:11.5/reset
# modprobe igbvf
# dmesg | grep igbvf \:01\:11.5\: Address\:
igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c

So yes, it does change.  However, if I set the VF mac instead of using a
randomly generated one, I get:

# modprobe -r igbvf
# ip link set eth2 vf 6 mac 02:00:10:91:73:01
# modprobe igbvf
# dmesg | grep igbvf \:01\:11.5\: Address\:
igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
igbvf :01:11.5: Address: 02:00:10:91:73:01
# modprobe -r igbvf
# echo 1  /sys/bus/pci/devices/:01:11.5/reset
# modprobe igbvf
# dmesg | grep igbvf \:01\:11.5\: Address\:
igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
igbvf :01:11.5: Address: 02:00:10:91:73:01
igbvf :01:11.5: Address: 02:00:10:91:73:01

So now it sticks.  You're going to get random mac addresses on the VFs
every time you reload the igb driver (ie. ever boot) anyway (at least
with these sr-iov cards), so if you need consistent macs, they probably
need to be set before launching the VM anyway.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern

On 04/20/11 20:35, Alex Williamson wrote:
 Device assignment via a VF provides the lowest latency and most
 bandwidth for *getting data off the host system*, though virtio/vhost is
 getting better.  If all you care about is VM-VM on the same host or
 VM-host, then virtio is only limited by memory bandwidth/latency and
 host processor cycles.  Your processor has 25GB/s of memory bandwidth.
 On the other hand, the VF has to send data all the way out to the wire
 and all the way back up through the NIC to get to the other VM/host.
 You're using a 1Gb/s NIC.  Your results actually seem to indicate you're
 getting better than wire rate, so maybe you're only passing through an
 internal switch on the NIC, in any case, VFs are not optimal for
 communication within the same physical system.  They are optimal for off
 host communication.  Thanks,

Hi Alex:

Host-host was the next focus for the tests. I have 2 of the
aforementioned servers, each configured identically. As a reminder:

Host:
  Dell R410
  2 quad core E5620@2.40 GHz processors
  16 GB RAM
  Intel 82576 NIC (Gigabit ET Quad Port)
  - devices eth2, eth3, eth4, eth5
  Fedora 14
  kernel: 2.6.35.12-88.fc14.x86_64
  qemu-kvm.git, ffce28fe6 (18-April-11)

VMs:
  Fedora 14
  kernel 2.6.35.11-83.fc14.x86_64
  2 vcpus
  1GB RAM
  2 NICs - 1 virtio, 1 VF

The virtio network arguments to qemu-kvm are:
  -netdev type=tap,vhost=on,ifname=tap0,id=netdev1
  -device virtio-net-pci,mac=${mac},netdev=netdev1


For this round of tests I have the following setup:

  .==.
  | Host - A |
  |  |
  |  .-. |
  |  |  Virtual Machine - C| |
  |  | | |
  |  |  .--. .--.  | |
  |  '--| eth1 |-| eth0 |--' |
  | '--' '--'|
  |   192.168. | | 192.168.103.71
  | 102.71 |  .--.   |
  ||  | tap0 |   |
  ||  '--'   |
  || |   |
  ||  .--.   |
  ||  |  br  | 192.168.103.79
  ||  '--'   |
  |   {VF}   |   |
  |   ..  .--.   |
  '===|  eth2  |==| eth3 |==='
  ''  '--'
192.168.102.79 | |
   | point-to-   |
   |   point |
   | connections |
192.168.102.80 | |
  ..  .--.
  .===|  eth2  |==| eth3 |===.
  |   ''  '--'   |
  |   {VF}   |   |
  ||  .--.   |
  ||  |  br  | 192.168.103.80
  ||  '--'   |
  || |   |
  ||  .--.   |
  ||  | tap0 |   |
  |   192.168. |  '--'   |
  | 102.81 | | 192.168.103.81
  | .--. .--.|
  |  .--| eth1 |-| eth0 |--. |
  |  |  '--' '--'  | |
  |  | | |
  |  |  Virtual Machine - D| |
  |  '-' |
  |  |
  | Host - B |
  '=='


So, basically, 192.168.102 is the network where the VMs have a VF, and
192.168.103 is the network where the VMs use virtio for networking.

The netperf commands are all run on either Host-A or VM-C:

  netperf -H $ip -jcC -v 2 -t TCP_RR  -- -r 1024 -D L,R
  netperf -H $ip -jcC -v 2 -t TCP_STREAM  -- -m 1024 -D L,R


   latency  throughput
(usec) Mbps
cross-host:
  A-B, eth2  185932
  A-B, eth3  185935

same host, host-VM:
  A-C, using VF  488   1085 (seen as high as 1280's)
  A-C, virtio150   4282

cross-host, host-VM:
  A-D, VF489938
  A-D, virtio288889

cross-host, VM-VM:
  C-D, VF488934
  C-D, virtio490933


While throughput for VFs is fine (near line-rate when crossing hosts),
the latency is horrible. Any options to improve that?

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern



On 04/25/11 11:30, Alex Williamson wrote:
 So yes, it does change.  However, if I set the VF mac instead of using a
 randomly generated one, I get:
 
 # modprobe -r igbvf
 # ip link set eth2 vf 6 mac 02:00:10:91:73:01
 # modprobe igbvf
 # dmesg | grep igbvf \:01\:11.5\: Address\:
 igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
 igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 # modprobe -r igbvf
 # echo 1  /sys/bus/pci/devices/:01:11.5/reset
 # modprobe igbvf
 # dmesg | grep igbvf \:01\:11.5\: Address\:
 igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
 igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 
 So now it sticks.  You're going to get random mac addresses on the VFs
 every time you reload the igb driver (ie. ever boot) anyway (at least
 with these sr-iov cards), so if you need consistent macs, they probably
 need to be set before launching the VM anyway.  Thanks,


You lost me on this. I do not have the igbvf driver loaded in the host,
only the guest. I am setting the MAC address for the VF in the host
before launching the VM. The host's igb driver gets loaded at boot only.

David


 
 Alex
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern



On 04/21/11 02:07, Avi Kivity wrote:
 On 04/21/2011 05:35 AM, Alex Williamson wrote:
 Device assignment via a VF provides the lowest latency and most
 bandwidth for *getting data off the host system*, though virtio/vhost is
 getting better.  If all you care about is VM-VM on the same host or
 VM-host, then virtio is only limited by memory bandwidth/latency and
 host processor cycles.  Your processor has 25GB/s of memory bandwidth.
 On the other hand, the VF has to send data all the way out to the wire
 and all the way back up through the NIC to get to the other VM/host.
 You're using a 1Gb/s NIC.  Your results actually seem to indicate you're
 getting better than wire rate, so maybe you're only passing through an
 internal switch on the NIC, in any case, VFs are not optimal for
 communication within the same physical system.  They are optimal for off
 host communication.  Thanks,

 
 Note I think in both cases we can make significant improvements:
 - for VFs, steer device interrupts to the cpus which run the vcpus that
 will receive the interrupts eventually (ISTR some work about this, but
 not sure)

I don't understand your point here. I thought interrupts for the VF were
only delivered to the guest, not the host.

David

 - for virtio, use a DMA engine to copy data (I think there exists code
 in upstream which does this, but has this been enabled/tuned?)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern



On 04/21/11 07:09, Avi Kivity wrote:
 On 04/21/2011 03:31 PM, Stefan Hajnoczi wrote:
 On Thu, Apr 21, 2011 at 9:07 AM, Avi Kivitya...@redhat.com  wrote:
   Note I think in both cases we can make significant improvements:
   - for VFs, steer device interrupts to the cpus which run the vcpus
 that will
   receive the interrupts eventually (ISTR some work about this, but
 not sure)
   - for virtio, use a DMA engine to copy data (I think there exists
 code in
   upstream which does this, but has this been enabled/tuned?)

 Which data copy in virtio?  Is this a vhost-net specific thing you're
 thinking about?
 
 There are several copies.
 
 qemu's virtio-net implementation incurs a copy on tx and on rx when
 calling the kernel; in addition there is also an internal copy:
 
 /* copy in packet.  ugh */
 len = iov_from_buf(sg, elem.in_num,
buf + offset, size - offset);
 
 In principle vhost-net can avoid the tx copy, but I think now we have 1
 copy on rx and tx each.

So there is a copy internal to qemu, then from qemu to the host tap
device and then tap device to a physical NIC if the packet is leaving
the host?

Is that what the zero-copy patch set is attempting - bypassing the
transmit copy to the macvtap device?

 
 If a host interface is dedicated to backing a vhost-net interface (say
 if you have an SR/IOV card) then you can in principle avoid the rx copy
 as well.
 
 An alternative to avoiding the copies is to use a dma engine, like I
 mentioned.
 

How does the DMA engine differ from the zero-copy patch set?

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern



On 04/25/11 11:30, Alex Williamson wrote:
 # modprobe -r igbvf
 # ip link set eth2 vf 6 mac 02:00:10:91:73:01
 # modprobe igbvf
 # dmesg | grep igbvf \:01\:11.5\: Address\:
 igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
 igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 # modprobe -r igbvf
 # echo 1  /sys/bus/pci/devices/:01:11.5/reset
 # modprobe igbvf
 # dmesg | grep igbvf \:01\:11.5\: Address\:
 igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
 igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 
 So now it sticks.  You're going to get random mac addresses on the VFs
 every time you reload the igb driver (ie. ever boot) anyway (at least
 with these sr-iov cards), so if you need consistent macs, they probably
 need to be set before launching the VM anyway.  Thanks,
 
 Alex
 

Ok, I was able to repeat the above commands from the host command line.

However, when qemu-kvm starts the MAC is reset.

# ip link show | less

2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
qlen 1000
link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 02:12:34:56:80:20

-- that's the MAC address I set

I start qemu-kvm (unpatched version) and the host side sees the address
changed:

# ip link show | less

2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
qlen 1000
link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 7a:17:3f:98:0f:db


Can you try that aspect on your end - seeing if the MAC address
maintains after starting qemu-kvm?

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 11:39 -0600, David Ahern wrote:
 On 04/20/11 20:35, Alex Williamson wrote:
  Device assignment via a VF provides the lowest latency and most
  bandwidth for *getting data off the host system*, though virtio/vhost is
  getting better.  If all you care about is VM-VM on the same host or
  VM-host, then virtio is only limited by memory bandwidth/latency and
  host processor cycles.  Your processor has 25GB/s of memory bandwidth.
  On the other hand, the VF has to send data all the way out to the wire
  and all the way back up through the NIC to get to the other VM/host.
  You're using a 1Gb/s NIC.  Your results actually seem to indicate you're
  getting better than wire rate, so maybe you're only passing through an
  internal switch on the NIC, in any case, VFs are not optimal for
  communication within the same physical system.  They are optimal for off
  host communication.  Thanks,
 
 Hi Alex:
 
 Host-host was the next focus for the tests. I have 2 of the
 aforementioned servers, each configured identically. As a reminder:
 
 Host:
   Dell R410
   2 quad core E5620@2.40 GHz processors
   16 GB RAM
   Intel 82576 NIC (Gigabit ET Quad Port)
   - devices eth2, eth3, eth4, eth5
   Fedora 14
   kernel: 2.6.35.12-88.fc14.x86_64
   qemu-kvm.git, ffce28fe6 (18-April-11)
 
 VMs:
   Fedora 14
   kernel 2.6.35.11-83.fc14.x86_64
   2 vcpus
   1GB RAM
   2 NICs - 1 virtio, 1 VF
 
 The virtio network arguments to qemu-kvm are:
   -netdev type=tap,vhost=on,ifname=tap0,id=netdev1
   -device virtio-net-pci,mac=${mac},netdev=netdev1
 
 
 For this round of tests I have the following setup:
 
   .==.
   | Host - A |
   |  |
   |  .-. |
   |  |  Virtual Machine - C| |
   |  | | |
   |  |  .--. .--.  | |
   |  '--| eth1 |-| eth0 |--' |
   | '--' '--'|
   |   192.168. | | 192.168.103.71
   | 102.71 |  .--.   |
   ||  | tap0 |   |
   ||  '--'   |
   || |   |
   ||  .--.   |
   ||  |  br  | 192.168.103.79
   ||  '--'   |
   |   {VF}   |   |
   |   ..  .--.   |
   '===|  eth2  |==| eth3 |==='
   ''  '--'
 192.168.102.79 | |
| point-to-   |
|   point |
| connections |
 192.168.102.80 | |
   ..  .--.
   .===|  eth2  |==| eth3 |===.
   |   ''  '--'   |
   |   {VF}   |   |
   ||  .--.   |
   ||  |  br  | 192.168.103.80
   ||  '--'   |
   || |   |
   ||  .--.   |
   ||  | tap0 |   |
   |   192.168. |  '--'   |
   | 102.81 | | 192.168.103.81
   | .--. .--.|
   |  .--| eth1 |-| eth0 |--. |
   |  |  '--' '--'  | |
   |  | | |
   |  |  Virtual Machine - D| |
   |  '-' |
   |  |
   | Host - B |
   '=='
 
 
 So, basically, 192.168.102 is the network where the VMs have a VF, and
 192.168.103 is the network where the VMs use virtio for networking.
 
 The netperf commands are all run on either Host-A or VM-C:
 
   netperf -H $ip -jcC -v 2 -t TCP_RR  -- -r 1024 -D L,R
   netperf -H $ip -jcC -v 2 -t TCP_STREAM  -- -m 1024 -D L,R
 
 
latency  throughput
 (usec) Mbps
 cross-host:
   A-B, eth2  185932
   A-B, eth3  185935

This is actually PF-PF, right?  It would be interesting to load igbvf on
the hosts and determine VF-VF latency as well.

 same host, host-VM:
   A-C, using VF  488   1085 (seen as high as 1280's)
   A-C, virtio150   4282

We know virtio has a shorter path for this test.

 cross-host, host-VM:
   A-D, VF489938
   A-D, virtio288889
 
 cross-host, VM-VM:
   C-D, VF488934
   C-D, virtio490933
 
 
 While throughput for VFs is fine (near line-rate when crossing hosts),

FWIW, it's not too difficult to get line rate on a 1Gbps network, even
some of the emulated NICs can do it.  There will be a difference in host
CPU power to get it though, where it should theoretically be emulated 
virtio  pci-assign.

 the latency is horrible.

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 12:04 -0600, David Ahern wrote:
 
 On 04/25/11 11:30, Alex Williamson wrote:
  # modprobe -r igbvf
  # ip link set eth2 vf 6 mac 02:00:10:91:73:01
  # modprobe igbvf
  # dmesg | grep igbvf \:01\:11.5\: Address\:
  igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
  igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
  igbvf :01:11.5: Address: 02:00:10:91:73:01
  # modprobe -r igbvf
  # echo 1  /sys/bus/pci/devices/:01:11.5/reset
  # modprobe igbvf
  # dmesg | grep igbvf \:01\:11.5\: Address\:
  igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
  igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
  igbvf :01:11.5: Address: 02:00:10:91:73:01
  igbvf :01:11.5: Address: 02:00:10:91:73:01
  
  So now it sticks.  You're going to get random mac addresses on the VFs
  every time you reload the igb driver (ie. ever boot) anyway (at least
  with these sr-iov cards), so if you need consistent macs, they probably
  need to be set before launching the VM anyway.  Thanks,
  
  Alex
  
 
 Ok, I was able to repeat the above commands from the host command line.
 
 However, when qemu-kvm starts the MAC is reset.
 
 # ip link show | less
 
 2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
 qlen 1000
 link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
 vf 0 MAC 02:12:34:56:80:20
 
 -- that's the MAC address I set
 
 I start qemu-kvm (unpatched version) and the host side sees the address
 changed:
 
 # ip link show | less
 
 2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
 qlen 1000
 link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
 vf 0 MAC 7a:17:3f:98:0f:db
 
 
 Can you try that aspect on your end - seeing if the MAC address
 maintains after starting qemu-kvm?

I don't see this happening on my system, once manually set the mac never
changes.  I can restart and reset the VM and the host and guest both
continue seeing the set mac address.  I tested it with both a recent
rhel6.1 host kernel as well as upstream 2.6.39-rc4.  If I switch to a VF
with an unset mac, those will change on each VM reset or restart.
Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern

On 04/25/11 12:13, Alex Williamson wrote:
 So, basically, 192.168.102 is the network where the VMs have a VF, and
 192.168.103 is the network where the VMs use virtio for networking.

 The netperf commands are all run on either Host-A or VM-C:

   netperf -H $ip -jcC -v 2 -t TCP_RR  -- -r 1024 -D L,R
   netperf -H $ip -jcC -v 2 -t TCP_STREAM  -- -m 1024 -D L,R


latency  throughput
 (usec) Mbps
 cross-host:
   A-B, eth2  185932
   A-B, eth3  185935
 
 This is actually PF-PF, right?  It would be interesting to load igbvf on
 the hosts and determine VF-VF latency as well.

yes, PF-PF. eth3 has the added bridge layer, but from what I can see the
overhead is noise. I added host-to-host to put the host-to-VM numbers in
perspective.

 
 same host, host-VM:
   A-C, using VF  488   1085 (seen as high as 1280's)
   A-C, virtio150   4282
 
 We know virtio has a shorter path for this test.

No complaints about the throughput numbers; the latency is the problem.

 
 cross-host, host-VM:
   A-D, VF489938
   A-D, virtio288889

 cross-host, VM-VM:
   C-D, VF488934
   C-D, virtio490933


 While throughput for VFs is fine (near line-rate when crossing hosts),
 
 FWIW, it's not too difficult to get line rate on a 1Gbps network, even
 some of the emulated NICs can do it.  There will be a difference in host
 CPU power to get it though, where it should theoretically be emulated 
 virtio  pci-assign.

10GB is the goal; 1GB offers a cheaper learning environment. ;-)

 
 the latency is horrible. Any options to improve that?
 
 If you don't mind testing, I'd like to see VF-VF between the hosts (to
 do this, don't assign eth2 an IP, just make sure it's up, then load the
 igbvf driver on the host and assign an IP to one of the VFs associated
 with the eth2 PF), and cross host testing using the PF for the guest
 instead of the VF.  This should help narrow down how much of the latency
 is due to using the VF vs the PF, since all of the virtio tests are
 using the PF.  I've been suspicious that the VF adds some latency, but
 haven't had a good test setup (or time) to dig very deep into it.

It's a quad nic, so I left eth2 and eth3 alone and added the VF-VF test
using VFs on eth4.

Indeed latency is 488 usec and throughput is 925 Mbps. This is
host-to-host using VFs.

David

 Thanks,
 
 Alex
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern



On 04/25/11 12:36, Alex Williamson wrote:
 On Mon, 2011-04-25 at 12:04 -0600, David Ahern wrote:

 On 04/25/11 11:30, Alex Williamson wrote:
 # modprobe -r igbvf
 # ip link set eth2 vf 6 mac 02:00:10:91:73:01
 # modprobe igbvf
 # dmesg | grep igbvf \:01\:11.5\: Address\:
 igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
 igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 # modprobe -r igbvf
 # echo 1  /sys/bus/pci/devices/:01:11.5/reset
 # modprobe igbvf
 # dmesg | grep igbvf \:01\:11.5\: Address\:
 igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
 igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
 igbvf :01:11.5: Address: 02:00:10:91:73:01
 igbvf :01:11.5: Address: 02:00:10:91:73:01

 So now it sticks.  You're going to get random mac addresses on the VFs
 every time you reload the igb driver (ie. ever boot) anyway (at least
 with these sr-iov cards), so if you need consistent macs, they probably
 need to be set before launching the VM anyway.  Thanks,

 Alex


 Ok, I was able to repeat the above commands from the host command line.

 However, when qemu-kvm starts the MAC is reset.

 # ip link show | less

 2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
 qlen 1000
 link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
 vf 0 MAC 02:12:34:56:80:20

 -- that's the MAC address I set

 I start qemu-kvm (unpatched version) and the host side sees the address
 changed:

 # ip link show | less

 2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
 qlen 1000
 link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
 vf 0 MAC 7a:17:3f:98:0f:db


 Can you try that aspect on your end - seeing if the MAC address
 maintains after starting qemu-kvm?
 
 I don't see this happening on my system, once manually set the mac never
 changes.  I can restart and reset the VM and the host and guest both
 continue seeing the set mac address.  I tested it with both a recent
 rhel6.1 host kernel as well as upstream 2.6.39-rc4.  If I switch to a VF
 with an unset mac, those will change on each VM reset or restart.

Blacklist igbvf in the host and you will. That must be the difference: I
was preventing the vf driver from loading in the host -- it's not needed
there, so why load it?

I rebooted for a fresh run. Loaded the igbvf driver before starting the
VM using my tools. With the igbvf driver loaded in the host the MAC
address for the VF was not reset.

As for why I blacklisted it -- udev. What a PITA with VFs. I saw the
feature for Fedora 15 which should address this.

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 13:12 -0600, David Ahern wrote:
 
 On 04/25/11 12:36, Alex Williamson wrote:
  On Mon, 2011-04-25 at 12:04 -0600, David Ahern wrote:
 
  On 04/25/11 11:30, Alex Williamson wrote:
  # modprobe -r igbvf
  # ip link set eth2 vf 6 mac 02:00:10:91:73:01
  # modprobe igbvf
  # dmesg | grep igbvf \:01\:11.5\: Address\:
  igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
  igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
  igbvf :01:11.5: Address: 02:00:10:91:73:01
  # modprobe -r igbvf
  # echo 1  /sys/bus/pci/devices/:01:11.5/reset
  # modprobe igbvf
  # dmesg | grep igbvf \:01\:11.5\: Address\:
  igbvf :01:11.5: Address: d2:c8:17:d6:97:f7
  igbvf :01:11.5: Address: 4e:ee:2a:d8:12:7c
  igbvf :01:11.5: Address: 02:00:10:91:73:01
  igbvf :01:11.5: Address: 02:00:10:91:73:01
 
  So now it sticks.  You're going to get random mac addresses on the VFs
  every time you reload the igb driver (ie. ever boot) anyway (at least
  with these sr-iov cards), so if you need consistent macs, they probably
  need to be set before launching the VM anyway.  Thanks,
 
  Alex
 
 
  Ok, I was able to repeat the above commands from the host command line.
 
  However, when qemu-kvm starts the MAC is reset.
 
  # ip link show | less
 
  2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
  qlen 1000
  link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
  vf 0 MAC 02:12:34:56:80:20
 
  -- that's the MAC address I set
 
  I start qemu-kvm (unpatched version) and the host side sees the address
  changed:
 
  # ip link show | less
 
  2: eth2: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc mq state UP
  qlen 1000
  link/ether 00:1b:21:98:b7:10 brd ff:ff:ff:ff:ff:ff
  vf 0 MAC 7a:17:3f:98:0f:db
 
 
  Can you try that aspect on your end - seeing if the MAC address
  maintains after starting qemu-kvm?
  
  I don't see this happening on my system, once manually set the mac never
  changes.  I can restart and reset the VM and the host and guest both
  continue seeing the set mac address.  I tested it with both a recent
  rhel6.1 host kernel as well as upstream 2.6.39-rc4.  If I switch to a VF
  with an unset mac, those will change on each VM reset or restart.
 
 Blacklist igbvf in the host and you will. That must be the difference: I
 was preventing the vf driver from loading in the host -- it's not needed
 there, so why load it?

I already have it blacklisted.  It's not needed if you're using the VFs
they way we are, but there are other uses.

 I rebooted for a fresh run. Loaded the igbvf driver before starting the
 VM using my tools. With the igbvf driver loaded in the host the MAC
 address for the VF was not reset.
 
 As for why I blacklisted it -- udev. What a PITA with VFs. I saw the
 feature for Fedora 15 which should address this.

Yes, my VM is up to renaming the VFs eth1340 since the mac changes every
boot.  I'm still confused though as I did a whole round of testing after
a reboot where igbvf was never loaded and the set mac address stuck
across VM restarts and resets.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 13:07 -0600, David Ahern wrote:
 On 04/25/11 12:13, Alex Williamson wrote:
  So, basically, 192.168.102 is the network where the VMs have a VF, and
  192.168.103 is the network where the VMs use virtio for networking.
 
  The netperf commands are all run on either Host-A or VM-C:
 
netperf -H $ip -jcC -v 2 -t TCP_RR  -- -r 1024 -D L,R
netperf -H $ip -jcC -v 2 -t TCP_STREAM  -- -m 1024 -D L,R
 
 
 latency  throughput
  (usec) Mbps
  cross-host:
A-B, eth2  185932
A-B, eth3  185935
  
  This is actually PF-PF, right?  It would be interesting to load igbvf on
  the hosts and determine VF-VF latency as well.
 
 yes, PF-PF. eth3 has the added bridge layer, but from what I can see the
 overhead is noise. I added host-to-host to put the host-to-VM numbers in
 perspective.
 
  
  same host, host-VM:
A-C, using VF  488   1085 (seen as high as 1280's)
A-C, virtio150   4282
  
  We know virtio has a shorter path for this test.
 
 No complaints about the throughput numbers; the latency is the problem.
 
  
  cross-host, host-VM:
A-D, VF489938
A-D, virtio288889
 
  cross-host, VM-VM:
C-D, VF488934
C-D, virtio490933
 
 
  While throughput for VFs is fine (near line-rate when crossing hosts),
  
  FWIW, it's not too difficult to get line rate on a 1Gbps network, even
  some of the emulated NICs can do it.  There will be a difference in host
  CPU power to get it though, where it should theoretically be emulated 
  virtio  pci-assign.
 
 10GB is the goal; 1GB offers a cheaper learning environment. ;-)
 
  
  the latency is horrible. Any options to improve that?
  
  If you don't mind testing, I'd like to see VF-VF between the hosts (to
  do this, don't assign eth2 an IP, just make sure it's up, then load the
  igbvf driver on the host and assign an IP to one of the VFs associated
  with the eth2 PF), and cross host testing using the PF for the guest
  instead of the VF.  This should help narrow down how much of the latency
  is due to using the VF vs the PF, since all of the virtio tests are
  using the PF.  I've been suspicious that the VF adds some latency, but
  haven't had a good test setup (or time) to dig very deep into it.
 
 It's a quad nic, so I left eth2 and eth3 alone and added the VF-VF test
 using VFs on eth4.
 
 Indeed latency is 488 usec and throughput is 925 Mbps. This is
 host-to-host using VFs.

So we're effectively getting host-host latency/throughput for the VF,
it's just that in the 82576 implementation of SR-IOV, the VF takes a
latency hit that puts it pretty close to virtio.  Unfortunate.  I think
you'll find that passing the PF to the guests should be pretty close to
that 185us latency.  I would assume (hope) the higher end NICs reduce
this, but it seems to be a hardware limitation, so it's hard to predict.
Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Connecting to a new guest VM

2011-04-25 Thread Emmanuel Noobadmin

Resending because the first did not appear to go through

This is probably a very noob question but I haven't been able to find
a solution that worked so far. Maybe it's just something really minor
that I've missed so I'll appreciate some pointers.

Running on Scientific Linux 6, bridged networking configured with
ifcfg-br0 and ifcfg-eth0, networking is working, I can ssh/vnc into
the host.

I created a guest using the following command as root following the
virt-install man page.

virt-install -n vm_01 -r 640 --vcpus=1
--file=/home/VMs/vm110401/vm_01_d1 -s 170 --nonsparse
--network=bridge:br0  --accelerate
--cdrom=/home/ISO/CentOS-5.6-x86_64-bin-DVD-1of2.iso --os-type=linux
--os-variant=rhel5

It seems to work, except I get a line that says Escape Char is ^]
And the console doesn't react to any further input except to exit.
Then it warns me that the OS is still being installed.

Being a noob, I figured maybe a GUI will be easier. So I installed X
desktop and created another VM with the same parameters except I added
--vnc --vncport=15901

However, I cannot connect to the VM, whether using the public IP or
through the LAN IP.

I have the vnc port allowed in iptables, the port is not the default
5901 since I already have the external VNC listening on that port.

I've also tried to connect to the VM via 127.0.0.1 through my VNC
session but depending on what I try (public, LAN, vnc from within vnc
to localhost) I get either connection refused or write: broken
pipe error.

Based on some google searches, I've also edited qemu.conf to include
the line vnc_listen= 0.0.0.0

But still no joy and from googling, apparently I'm not the only noob
who find myself stuck. So I'll appreciate it greatly if somebody could
point out what's missing or wrong, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern



On 04/25/11 13:29, Alex Williamson wrote:
 So we're effectively getting host-host latency/throughput for the VF,
 it's just that in the 82576 implementation of SR-IOV, the VF takes a
 latency hit that puts it pretty close to virtio.  Unfortunate.  I think

For host-to-VM using VFs is worse than virtio which is counterintuitive.

 you'll find that passing the PF to the guests should be pretty close to
 that 185us latency.  I would assume (hope) the higher end NICs reduce

About that 185usec: do you know where the bottleneck is? It seems as if
the packet is held in some queue waiting for an event/timeout before it
is transmitted.

David


 this, but it seems to be a hardware limitation, so it's hard to predict.
 Thanks,
 
 Alex
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:
 
 On 04/25/11 13:29, Alex Williamson wrote:
  So we're effectively getting host-host latency/throughput for the VF,
  it's just that in the 82576 implementation of SR-IOV, the VF takes a
  latency hit that puts it pretty close to virtio.  Unfortunate.  I think
 
 For host-to-VM using VFs is worse than virtio which is counterintuitive.

On the same host, just think about the data path of one versus the
other.  On the guest side, there's virtio vs a physical NIC.  virtio is
designed to be virtualization friendly, so hopefully has less context
switches in setting up and processing transactions.  Once the packet
leaves the assigned physical NIC, it has to come back up the entire host
I/O stack, while the virtio device is connected to an internal bridge
and bypasses all but the upper level network routing.

  you'll find that passing the PF to the guests should be pretty close to
  that 185us latency.  I would assume (hope) the higher end NICs reduce
 
 About that 185usec: do you know where the bottleneck is? It seems as if
 the packet is held in some queue waiting for an event/timeout before it
 is transmitted.

I don't know specifically, I don't do much network performance tuning.
Interrupt coalescing could be a factor, along with various offload
settings, and of course latency of the physical NIC hardware and
interconnects.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern



On 04/25/11 13:18, Alex Williamson wrote:
 I don't see this happening on my system, once manually set the mac never
 changes.  I can restart and reset the VM and the host and guest both
 continue seeing the set mac address.  I tested it with both a recent
 rhel6.1 host kernel as well as upstream 2.6.39-rc4.  If I switch to a VF
 with an unset mac, those will change on each VM reset or restart.

 Blacklist igbvf in the host and you will. That must be the difference: I
 was preventing the vf driver from loading in the host -- it's not needed
 there, so why load it?
 
 I already have it blacklisted.  It's not needed if you're using the VFs
 they way we are, but there are other uses.
 
 I rebooted for a fresh run. Loaded the igbvf driver before starting the
 VM using my tools. With the igbvf driver loaded in the host the MAC
 address for the VF was not reset.

 As for why I blacklisted it -- udev. What a PITA with VFs. I saw the
 feature for Fedora 15 which should address this.
 
 Yes, my VM is up to renaming the VFs eth1340 since the mac changes every
 boot.  I'm still confused though as I did a whole round of testing after
 a reboot where igbvf was never loaded and the set mac address stuck
 across VM restarts and resets.
 
 Alex
 

The resetting of the VM MAC address is fixed in 2.6.39-rc4, so it's a
Fedora 14, 2.6.35.12 problem.

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern



On 04/25/11 14:27, Alex Williamson wrote:
 On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:

 On 04/25/11 13:29, Alex Williamson wrote:
 So we're effectively getting host-host latency/throughput for the VF,
 it's just that in the 82576 implementation of SR-IOV, the VF takes a
 latency hit that puts it pretty close to virtio.  Unfortunate.  I think

 For host-to-VM using VFs is worse than virtio which is counterintuitive.
 
 On the same host, just think about the data path of one versus the
 other.  On the guest side, there's virtio vs a physical NIC.  virtio is
 designed to be virtualization friendly, so hopefully has less context
 switches in setting up and processing transactions.  Once the packet
 leaves the assigned physical NIC, it has to come back up the entire host
 I/O stack, while the virtio device is connected to an internal bridge
 and bypasses all but the upper level network routing.

I get the virtio path, but you lost me on the physical NIC. I thought
the point of VFs is to bypass the host from having to touch the packet,
so the processing path with a VM using a VF would be the same as a non-VM.

David


 
 you'll find that passing the PF to the guests should be pretty close to
 that 185us latency.  I would assume (hope) the higher end NICs reduce

 About that 185usec: do you know where the bottleneck is? It seems as if
 the packet is held in some queue waiting for an event/timeout before it
 is transmitted.
 
 I don't know specifically, I don't do much network performance tuning.
 Interrupt coalescing could be a factor, along with various offload
 settings, and of course latency of the physical NIC hardware and
 interconnects.  Thanks,
 
 Alex
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread Andrew Theurer

On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:
 
 On 04/25/11 13:29, Alex Williamson wrote:
  So we're effectively getting host-host latency/throughput for the VF,
  it's just that in the 82576 implementation of SR-IOV, the VF takes a
  latency hit that puts it pretty close to virtio.  Unfortunate.  I think
 
 For host-to-VM using VFs is worse than virtio which is counterintuitive.
 
  you'll find that passing the PF to the guests should be pretty close to
  that 185us latency.  I would assume (hope) the higher end NICs reduce
 
 About that 185usec: do you know where the bottleneck is? It seems as if
 the packet is held in some queue waiting for an event/timeout before it
 is transmitted.

you might want to check the VF driver.  I know versions of the ixgbevf
driver have a throttled interrupt option which will increase latency
with some settings.  I don't remember if the igbvf driver has the same
feature.  If it does, you will want to turn this option off for best
latency.

 
 David
 
 
  this, but it seems to be a hardware limitation, so it's hard to predict.
  Thanks,
  
  Alex

-Andrew

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 14:40 -0600, David Ahern wrote:
 
 On 04/25/11 14:27, Alex Williamson wrote:
  On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:
 
  On 04/25/11 13:29, Alex Williamson wrote:
  So we're effectively getting host-host latency/throughput for the VF,
  it's just that in the 82576 implementation of SR-IOV, the VF takes a
  latency hit that puts it pretty close to virtio.  Unfortunate.  I think
 
  For host-to-VM using VFs is worse than virtio which is counterintuitive.
  
  On the same host, just think about the data path of one versus the
  other.  On the guest side, there's virtio vs a physical NIC.  virtio is
  designed to be virtualization friendly, so hopefully has less context
  switches in setting up and processing transactions.  Once the packet
  leaves the assigned physical NIC, it has to come back up the entire host
  I/O stack, while the virtio device is connected to an internal bridge
  and bypasses all but the upper level network routing.
 
 I get the virtio path, but you lost me on the physical NIC. I thought
 the point of VFs is to bypass the host from having to touch the packet,
 so the processing path with a VM using a VF would be the same as a non-VM.

In the VF case, the host is only involved in processing the packet on
it's end of the connection, but the packet still has to go all the way
out to the physical device and all the way back.  Handled on one end by
the VM and the other end by the host.

An analogy might be sending a letter to an office coworker in a
neighboring cube.  You could just pass the letter over the wall (virtio)
or you could go put it in the mailbox, signal the mail carrier, who
comes and moves it to your neighbor's mailbox, who then gets signaled
that they have a letter (device assignment).

Since the networks stacks are completely separate from one another,
there's very little difference in data path whether you're talking to
the host, a remote system, or a remote VM, which is reflected in your
performance data.  Hope that helps,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread David Ahern



On 04/25/11 15:02, Alex Williamson wrote:
 On Mon, 2011-04-25 at 14:40 -0600, David Ahern wrote:

 On 04/25/11 14:27, Alex Williamson wrote:
 On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:

 On 04/25/11 13:29, Alex Williamson wrote:
 So we're effectively getting host-host latency/throughput for the VF,
 it's just that in the 82576 implementation of SR-IOV, the VF takes a
 latency hit that puts it pretty close to virtio.  Unfortunate.  I think

 For host-to-VM using VFs is worse than virtio which is counterintuitive.

 On the same host, just think about the data path of one versus the
 other.  On the guest side, there's virtio vs a physical NIC.  virtio is
 designed to be virtualization friendly, so hopefully has less context
 switches in setting up and processing transactions.  Once the packet
 leaves the assigned physical NIC, it has to come back up the entire host
 I/O stack, while the virtio device is connected to an internal bridge
 and bypasses all but the upper level network routing.

 I get the virtio path, but you lost me on the physical NIC. I thought
 the point of VFs is to bypass the host from having to touch the packet,
 so the processing path with a VM using a VF would be the same as a non-VM.
 
 In the VF case, the host is only involved in processing the packet on
 it's end of the connection, but the packet still has to go all the way
 out to the physical device and all the way back.  Handled on one end by
 the VM and the other end by the host.
 
 An analogy might be sending a letter to an office coworker in a
 neighboring cube.  You could just pass the letter over the wall (virtio)
 or you could go put it in the mailbox, signal the mail carrier, who
 comes and moves it to your neighbor's mailbox, who then gets signaled
 that they have a letter (device assignment).
 
 Since the networks stacks are completely separate from one another,
 there's very little difference in data path whether you're talking to
 the host, a remote system, or a remote VM, which is reflected in your
 performance data.  Hope that helps,

Got you. I was thinking host-VM as VM on separate host; I didn't make
that clear. Thanks for clarifying - I like the letter example.

David

 
 Alex
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci passthrough - VF reset at boot is dropping assigned MAC

2011-04-25 Thread David Ahern



On 04/25/11 14:29, David Ahern wrote:
 
 
 On 04/25/11 13:18, Alex Williamson wrote:
 I don't see this happening on my system, once manually set the mac never
 changes.  I can restart and reset the VM and the host and guest both
 continue seeing the set mac address.  I tested it with both a recent
 rhel6.1 host kernel as well as upstream 2.6.39-rc4.  If I switch to a VF
 with an unset mac, those will change on each VM reset or restart.

 Blacklist igbvf in the host and you will. That must be the difference: I
 was preventing the vf driver from loading in the host -- it's not needed
 there, so why load it?

 I already have it blacklisted.  It's not needed if you're using the VFs
 they way we are, but there are other uses.

 I rebooted for a fresh run. Loaded the igbvf driver before starting the
 VM using my tools. With the igbvf driver loaded in the host the MAC
 address for the VF was not reset.

 As for why I blacklisted it -- udev. What a PITA with VFs. I saw the
 feature for Fedora 15 which should address this.

 Yes, my VM is up to renaming the VFs eth1340 since the mac changes every
 boot.  I'm still confused though as I did a whole round of testing after
 a reboot where igbvf was never loaded and the set mac address stuck
 across VM restarts and resets.

 Alex

 
 The resetting of the VM MAC address is fixed in 2.6.39-rc4, so it's a
 Fedora 14, 2.6.35.12 problem.

Just to finish this off. This is the patch that fixed the MAC address
reset problem:

commit a6b5ea353845b3f3d9ac4317c0b3be9cc37c259b
Author: Greg Rose gregory.v.r...@intel.com
Date:   Sat Nov 6 05:42:59 2010 +

igb: Warn on attempt to override administratively set MAC/VLAN

Print a warning message to the system log when the VF attempts to
override administratively set MAC/VLAN configuration.

Signed-off-by: Greg Rose gregory.v.r...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance of virtual functions compared to virtio

2011-04-25 Thread Alex Williamson

On Mon, 2011-04-25 at 15:14 -0600, David Ahern wrote:
 
 On 04/25/11 15:02, Alex Williamson wrote:
  On Mon, 2011-04-25 at 14:40 -0600, David Ahern wrote:
 
  On 04/25/11 14:27, Alex Williamson wrote:
  On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:
 
  On 04/25/11 13:29, Alex Williamson wrote:
  So we're effectively getting host-host latency/throughput for the VF,
  it's just that in the 82576 implementation of SR-IOV, the VF takes a
  latency hit that puts it pretty close to virtio.  Unfortunate.  I think
 
  For host-to-VM using VFs is worse than virtio which is counterintuitive.
 
  On the same host, just think about the data path of one versus the
  other.  On the guest side, there's virtio vs a physical NIC.  virtio is
  designed to be virtualization friendly, so hopefully has less context
  switches in setting up and processing transactions.  Once the packet
  leaves the assigned physical NIC, it has to come back up the entire host
  I/O stack, while the virtio device is connected to an internal bridge
  and bypasses all but the upper level network routing.
 
  I get the virtio path, but you lost me on the physical NIC. I thought
  the point of VFs is to bypass the host from having to touch the packet,
  so the processing path with a VM using a VF would be the same as a non-VM.
  
  In the VF case, the host is only involved in processing the packet on
  it's end of the connection, but the packet still has to go all the way
  out to the physical device and all the way back.  Handled on one end by
  the VM and the other end by the host.
  
  An analogy might be sending a letter to an office coworker in a
  neighboring cube.  You could just pass the letter over the wall (virtio)
  or you could go put it in the mailbox, signal the mail carrier, who
  comes and moves it to your neighbor's mailbox, who then gets signaled
  that they have a letter (device assignment).
  
  Since the networks stacks are completely separate from one another,
  there's very little difference in data path whether you're talking to
  the host, a remote system, or a remote VM, which is reflected in your
  performance data.  Hope that helps,
 
 Got you. I was thinking host-VM as VM on separate host; I didn't make
 that clear. Thanks for clarifying - I like the letter example.

I should probably also note that being able to pass a letter over the
wall is possible because of the bridge/tap setup used for that
communication path, so it's available to emulated NICs as well.  virtio
is just a paravirtualization layer that makes it lower overhead than
emulation.  To get a letter out of the office (ie. off host), all
paths still eventually need to put the letter in the mailbox.  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A Live Backup feature for KVM

2011-04-25 Thread Jagane Sundar


On 4/25/2011 6:34 AM, Stefan Hajnoczi wrote:

On Mon, Apr 25, 2011 at 9:16 AM, Jagane Sundarjag...@sundar.org  wrote:

The direction that I chose to go is slightly different. In both of the
proposals you pointed me at, the original virtual disk is made
read-only and the VM writes to a different COW file. After backup
of the original virtual disk file is complete, the COW file is merged
with the original vdisk file.

Instead, I create an Original-Blocks-COW-file to store the original
blocks that are overwritten by the VM everytime the VM performs
a write while the backup is in progress. Livebackup copies these
underlying blocks from the original virtual disk file before the VM's
write to the original virtual disk file is scheduled. The advantage of
this is that there is no merge necessary at the end of the backup, we
can simply delete the Original-Blocks-COW-file.

The advantage of the approach that redirects writes to a new file
instead is that the heavy work of copying data is done asynchronously
during the merge operation instead of in the write path which will
impact guest performance.

Here's what I understand:

1. User takes a snapshot of the disk, QEMU creates old-disk.img backed
by the current-disk.img.
2. Guest issues a write A.
3. QEMU reads B from current-disk.img.
4. QEMU writes B to old-disk.img.
5. QEMU writes A to current-disk.img.
6. Guest receives write completion A.

The tricky thing is what happens if there is a failure after Step 5.
If writes A and B were unstable writes (no fsync()) then no ordering
is guaranteed and perhaps write A reached current-disk.img but write B
did not reach old-disk.img.  In this case we no longer have a
consistent old-disk.img snapshot - we're left with an updated
current-disk.img and old-disk.img does not have a copy of the old
data.


In both approaches the number of I/O operations remains constant:

WRITES_TO_NEW_FILE_APPROACH
Create snapshot
- As new writes from the VM come in:
1. Write to new-disk.img
Asynchronously:
a. Read from new-disk.img
b. Write into old-disk.img
Delete snapshot

WRITES_TO_CURRENT_FILE_APPROACH
Create snapshot
- As new writes from the VM come in:
1. Read old block from current-disk.img
2. Write old block to old-disk.img
3. Write new block to current-disk.img
Delete snapshot

The number of I/O operations is 2 writes and 1 read, in both cases.
The critical factor, then, is the duration for which the VM must
maintain the snapshot.


The solution is to fsync() after Step 4 and before Step 5 but this
will hurt performance.  We now have an extra read, write, and fsync()
on every write.

I agree - fsync() just defeats the whole purpose of building a super 
efficient

live backup mechanism. I'm not planning to introduce fsync()s.
However, I want to treat the snapshot as a limited snapshot, only for backup
purposes. In my proposal, the old-disk.img is valid only for the time when
the livebackup client connects to qemu and transfers the blocks for
that backup over. If the disk suffers an intermittent failure after (5),
then the snapshot is deemed inconsistent, and discarded.


I have some reasons to believe that the Original-Blocks-COW-file
design that I am putting forth might work better. I have listed them
below. (It's past midnight here, so pardon me if it sounds garbled -- I
will try to clarify more in a writeup on wiki.qemu.org).
Let me know what your thoughts are..

I feel that the livebackup mechanism will impact the running VM
less. For example, if something goes wrong with the backup process,
then we can simply delete the Original-Blocks-COW-file and force
the backup client to do a full backup the next time around. The
running VM or its virtual disks are not impacted at all.

Abandoning snapshots is not okay.  Snapshots will be used in scenarios
beyond backup and I don't think we can make them
unreliable/throw-away.

My proposal is to treat the snapshot as a specific to livebackup entitiy 
that exists
only for the duration of the livebackup_client's connection to qemu to 
transfer
the blocks over. At other times, there is no snapshot, just a dirty 
blocks bitmap

indicating which blocks were modified since the last backup was taken.

Consider the use case of daily incremental backups:

WRITES_TO_NEW_FILE_APPROACH
- 1:00 AM Create snapshot A
24 hours go by. All writes by the VM
during this time are stored in the new-disk.img file.
- 1 AM next day, the backup program starts copying its
  incremental backup blocks, i.e. the blocks that were modified
  in the last 24 hours, and are all stored in new-disk.img
- 1:15 AM Merge snapshot A
 The asynchronous process now kicks in, and starts merging
the blocks from new-disk.img into the old-disk.img
- 1:15 AM Create snapshot B

WRITES_TO_CURRENT_FILE_APPROACH
- 1:00 AM livebackup_client connects to qemu and creates snapshot
- livebackup_client starts transferring blocks modified by VM
  in the last 24 hours over the network to the backup server.
  Let's say that this takes about 15 minutes.
-

Re: [PATCH 1/1 v2] KVM: MMU: Optimize guest page table walk

2011-04-25 Thread Takuya Yoshikawa

On Mon, 25 Apr 2011 11:15:20 +0200
Jan Kiszka jan.kis...@web.de wrote:

  Sorry, I did not test on x86_32.
  
  Introducing a wrapper function with ifdef would be the best way?
  
 
 Maybe you could also add the missing 64-bit get_user for x86-32. Given
 that we have a corresponding put_user, I wonder why the get_user was
 left out.
 
 Jan
 

Google said that there was a similar talk on LKML in 2004.

On that threads, Linus explained how to tackle on the 64-bit get_user
implementation.  But I could not see what happened after that.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] virtio_balloon: disable oom killer when fill balloon

2011-04-25 Thread Dave Young

When memory pressure is high, virtio ballooning will probably cause oom killing.
Even if alloc_page with GFP_NORETRY itself does not directly trigger oom it
will make memory becoming low then memory alloc of other processes will trigger
oom killing. It is not desired behaviour.

Here disable oom killer in fill_balloon to address this issue.

Signed-off-by: Dave Young hidave.darks...@gmail.com
---
 drivers/virtio/virtio_balloon.c |3 +++
 1 file changed, 3 insertions(+)

--- linux-2.6.orig/drivers/virtio/virtio_balloon.c  2010-10-13 
10:14:38.0 +0800
+++ linux-2.6/drivers/virtio/virtio_balloon.c   2011-04-26 11:38:43.979785141 
+0800
@@ -25,6 +25,7 @@
 #include linux/freezer.h
 #include linux/delay.h
 #include linux/slab.h
+#include linux/oom.h
 
 struct virtio_balloon
 {
@@ -102,6 +103,7 @@ static void fill_balloon(struct virtio_b
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
+   oom_killer_disable();
for (vb-num_pfns = 0; vb-num_pfns  num; vb-num_pfns++) {
struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
__GFP_NOMEMALLOC | __GFP_NOWARN);
@@ -119,6 +121,7 @@ static void fill_balloon(struct virtio_b
vb-num_pages++;
list_add(page-lru, vb-pages);
}
+   oom_killer_enable();
 
/* Didn't get any?  Oh well. */
if (vb-num_pfns == 0)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

58 matches

Mail list logo