Re: [PATCH -tip 4/6 V4] x86: kprobes checks safeness of insertion address.

2009-04-02 Thread Ananth N Mavinakayanahalli
On Thu, Apr 02, 2009 at 01:24:57PM -0400, Masami Hiramatsu wrote:
 
> +/* Recover original instruction */
> +static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long 
> addr)
> +{
> + struct kprobe *kp;
> + kp = get_kprobe((void *)addr);
> + if (!kp)
> + return -EINVAL;
> +
> + /* Don't use p->ainsn.insn; which will be modified by fix_riprel */
> + memcpy(buf, kp->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
> + buf[0] = kp->opcode;
> + return 0;
> +}
> +
> +/* Dummy buffers for lookup_symbol_attrs */
> +static char __dummy_buf[KSYM_NAME_LEN];
> +
> +/* Check whether the address can be probed */
> +static int __kprobes can_probe(unsigned long paddr)

A better description would've been "Check if paddr is at an instruction
boundary". Otherwise looks good.

Ananth
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Jeremy Fitzhardinge

Michael S. Tsirkin wrote:

Rusty, I think this is what you did in your patch from 2008 to add destructor
for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 
):
and it seems that it would make zero-copy possible - or was there some problem 
with
that approach? Do you happen to remember?
  


I'm planning on resurrecting it to replace the page destructor used by 
Xen netback.


   J

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-userspace: fix option_rom_setup_reset address

2009-04-02 Thread Ryan Harper
Commit f2b690ba461971fb8b04354de8717a73fd08b945 changed the target
address for option roms, but failed to use the same address when
registering an option rom reset.  This manifests itself when using
extboot (boot=on) and reseting a guest via reboot or system_reset on
monitor and the guest fails to boot.  This patch register the correct
region for each option rom.


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com


diffstat output:
 pc.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Signed-off-by: Ryan Harper 
---
diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index d4a4320..a649ecf 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -985,7 +985,7 @@ vga_bios_error:
before INT 19h.  See the PNPBIOS specification, appendix B.
DDIM support is mandatory for proper PCI expansion ROM support. */
 cpu_register_physical_memory(offset, size, option_rom_offset /* | 
IO_MEM_ROM */);
-option_rom_setup_reset(0xd + offset, size);
+option_rom_setup_reset(offset, size);
 offset += size;
 }
 pci_option_rom_offset = offset;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Commit 3d28613c225ba94062950dacbb2304b2d2024abc breaks linux boot

2009-04-02 Thread Sheng Yang
On Friday 03 April 2009 05:15:03 Ryan Harper wrote:
> * Gleb Natapov  [2009-04-01 09:54]:
> > On Wed, Apr 01, 2009 at 05:49:08PM +0300, Avi Kivity wrote:
> > > Gleb Natapov wrote:
> > >> Commit 3d28613c225ba94062950dacbb2304b2d2024abc break linux boot.
> > >> It hangs after printing:
> > >>  SMP alternatives: switching to UP code
> > >
> > > Does dropping bit 8 from context->rsvd_bits_mask[0][1]
> > > (PT64_ROOT_LEVEL) help?
> >
> > Yep.
>
> tip is still broken for me, did a fix go in for this?

Yes. The fix have already been picked up by Avi, please wait a while for push.

Thanks.

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-02 Thread Zhang, Xiantao
Avi Kivity wrote:
> Zhang, Yang wrote:
>> The data from dma will include instructions. In order to exeuting
>> the right 
>> instruction, we should to flush the i-cache to ensure those data can
>> be see 
>> by cpu.
>> 
>> 
>> 
>> diff --git a/qemu/cache-utils.h b/qemu/cache-utils.h
>> index b45fde4..5e11d12 100644
>> --- a/qemu/cache-utils.h
>> +++ b/qemu/cache-utils.h
>> @@ -33,8 +33,22 @@ static inline void flush_icache_range(unsigned
>>  long start, unsigned long stop) asm volatile ("sync" : : :
>>  "memory"); asm volatile ("isync" : : : "memory");
>>  }
>> +#define qemu_sync_idcache flush_icache_range
>> +#else
>> 
>> +#ifdef __ia64__
>> +static inline void qemu_sync_idcache(unsigned long start, unsigned
>> long stop) +{ +while (start < stop) {
>> +asm volatile ("fc %0" :: "r"(start));
>> +start += 32;
>> +}
>> +asm volatile (";;sync.i;;srlz.i;;");
>> +}
>> 
> 
> What about smp?
> 
> I'm surprised the guest doesn't do this by itself?
> 
>> 
>>  void pstrcpy(char *buf, int buf_size, const char *str)
>> @@ -215,6 +216,8 @@ void qemu_iovec_from_buffer(QEMUIOVector *qiov,
>>  const void *buf, size_t count) if (copy >
>>  qiov->iov[i].iov_len) copy = qiov->iov[i].iov_len;
>>  memcpy(qiov->iov[i].iov_base, p, copy);
>> +qemu_sync_idcache((unsigned long)qiov->iov[i].iov_base,
>> +(unsigned long)(qiov->iov[i].iov_base + copy));
>>  p += copy; count -= copy;
>>  }
>> 
> 
> This is the wrong place to put this.  Once we stop bouncing
> scatter/gather DMA, we will no longer call this function.

This patch intends to fix the issue before adopting scatter/gather DMA mode. 
But if we want to keep this funtion, had better to pick it to avoid such issues 
in future. 

> The correct place is either in the device code itself, or in the dma
> api (dma-helpers.c).

Maybe dma-helpers.c 
Xiantao--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-02 Thread Zhang, Xiantao
Avi Kivity wrote:
> Avi Kivity wrote:
 
>>> 
>>> It doesn't had to do it.  The PCI transaction will automatically
>>> invalidate caches - but qemu doesn't emulate this (and doesn't need
>>> to do on x86). 
>>> 
>> 
>> So any DMA on ia64 will flush the instruction caches?!
>> 
> 
> Or maybe, the host kernel will do it after the transaction completes?

Host kernel doesn't do anything about cache flush after DMA, since it thinks 
platform guarantees that. 

> In our case the lack of zero-copy means the host is invalidating the
> wrong addresses (memcpy source) and leaving the real destination
> intact. 

We just need to sync the target address(destination address), because only its 
physical address belongs to guest, and likely to be the DMA target address of 
guest. 

Xiantao
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-02 Thread Zhang, Xiantao
Avi Kivity wrote:
> tging...@free.fr wrote:
>> 
>>> What about smp?
>>> 
>> 
>> fc will broadcast to the coherence domain the cache invalidation. 
>> So it is SMP-ready for usual machines. 
>> 
>> 
> 
> Interesting.
> 
>>> I'm surprised the guest doesn't do this by itself?
>>> 
>> 
>> It doesn't had to do it.  The PCI transaction will automatically
>> invalidate caches - but qemu doesn't emulate this (and doesn't need
>> to do on x86). 
>> 
> 
> So any DMA on ia64 will flush the instruction caches?

Yes, physical DMA should do this, but for virtual DMA operation emulated by 
Qemu should use explict intrusctions(fc, sync.i) to get it happen, because the 
data transferred by virtual DMA maybe used as instrustion streams by guest. 
Xiantao--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
Anthony Liguori  wrote:
>
> Anyway, if we're able to send this many packets, I suspect we'll be able 
> to also handle much higher throughputs without TX mitigation so that's 
> what I'm going to look at now.

Awesome! I'm prepared to eat my words :)

On the subject of TX mitigation, can we please set a standard
on how we measure it? For instance, do we bind the the backend
qemu to the same CPU as the guest, or do we bind it to a different
CPU that shares cache? They're two completely different scenarios
and I think we should be explicit about which one we're measuring.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip 2/6 V4] x86: add arch-dep register and stack access API to ptrace

2009-04-02 Thread Masami Hiramatsu
Frederic Weisbecker wrote:
> On Thu, Apr 02, 2009 at 01:24:47PM -0400, Masami Hiramatsu wrote:
>> Add following APIs for accessing registers and stack entries from pt_regs.
>> - query_register_offset(const char *name)
>>Query the offset of "name" register.
>>
>> - query_register_name(unsigned offset)
>>Query the name of register by its offset.
>>
>> - get_register(struct pt_regs *regs, unsigned offset)
>>Get the value of a register by its offset.
>>
>> - valid_stack_address(struct pt_regs *regs, unsigned long addr)
>>Check the address is in the stack.
>>
>> - get_stack_nth(struct pt_regs *reg, unsigned nth)
>>Get Nth entry of the stack. (N >= 0)
>>
>> - get_argument_nth(struct pt_regs *reg, unsigned nth)
>>Get Nth argument at function call. (N >= 0)
>>
>> Signed-off-by: Masami Hiramatsu 
>> Cc: Steven Rostedt 
>> Cc: Ananth N Mavinakayanahalli 
>> Cc: Ingo Molnar 
>> Cc: Frederic Weisbecker 
>> ---
>>
>>  arch/x86/include/asm/ptrace.h |   66 
>> +
>>  arch/x86/kernel/ptrace.c  |   59 +
>>  2 files changed, 125 insertions(+), 0 deletions(-)
>>
>>
>> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> index aed0894..44773b8 100644
>> --- a/arch/x86/include/asm/ptrace.h
>> +++ b/arch/x86/include/asm/ptrace.h
>> @@ -7,6 +7,7 @@
>>
>>  #ifdef __KERNEL__
>>  #include 
>> +#include 
>>  #endif
>>
>>  #ifndef __ASSEMBLY__
>> @@ -215,6 +216,71 @@ static inline unsigned long user_stack_pointer(struct 
>> pt_regs *regs)
>>  return regs->sp;
>>  }
>>
>> +/* Query offset/name of register from its name/offset */
>> +extern int query_register_offset(const char *name);
>> +extern const char *query_register_name(unsigned offset);
>> +#define MAX_REG_OFFSET (offsetof(struct pt_regs, sp))
>> +
>> +/* Get register value from its offset */
>> +static inline unsigned long get_register(struct pt_regs *regs, unsigned 
>> offset)
>> +{
>> +if (unlikely(offset > MAX_REG_OFFSET))
>> +return 0;
>> +return *(unsigned long *)((unsigned long)regs + offset);
>> +}
>> +
>> +/* Check the address in the stack */
>> +static inline int valid_stack_address(struct pt_regs *regs, unsigned long 
>> addr)
>> +{
>> +return ((addr & ~(THREAD_SIZE - 1))  ==
>> +(kernel_trap_sp(regs) & ~(THREAD_SIZE - 1)));
>> +}
>> +
>> +/* Get Nth entry of the stack */
>> +static inline unsigned long get_stack_nth(struct pt_regs *regs, unsigned n)
>> +{
>> +unsigned long *addr = (unsigned long *)kernel_trap_sp(regs);
>> +addr += n;
>> +if (valid_stack_address(regs, (unsigned long)addr))
>> +return *addr;
>> +else
>> +return 0;
>> +}
>> +
>> +/* Get Nth argument at function call */
>> +static inline unsigned long get_argument_nth(struct pt_regs *regs, unsigned 
>> n)
>> +{
>> +#ifdef CONFIG_X86_32
>> +#define NR_REGPARMS 3
>> +if (n < NR_REGPARMS) {
>> +switch (n) {
>> +case 0: return regs->ax;
>> +case 1: return regs->dx;
>> +case 2: return regs->cx;
>> +}
>> +return 0;
>> +#else /* CONFIG_X86_64 */
>> +#define NR_REGPARMS 6
>> +if (n < NR_REGPARMS) {
>> +switch (n) {
>> +case 0: return regs->di;
>> +case 1: return regs->si;
>> +case 2: return regs->dx;
>> +case 3: return regs->cx;
>> +case 4: return regs->r8;
>> +case 5: return regs->r9;
>> +}
>> +return 0;
>> +#endif
>> +} else {
>> +/*
>> + * The typical case: arg n is on the stack.
>> + * (Note: stack[0] = return address, so skip it)
>> + */
>> +return get_stack_nth(regs, 1 + n - NR_REGPARMS);
>> +}
>> +}
>> +
>>  /*
>>   * These are defined as per linux/ptrace.h, which see.
>>   */
>> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> index 5c6e463..8d65dcb 100644
>> --- a/arch/x86/kernel/ptrace.c
>> +++ b/arch/x86/kernel/ptrace.c
>> @@ -46,6 +46,65 @@ enum x86_regset {
>>  REGSET_IOPERM32,
>>  };
>>
>> +struct pt_regs_offset {
>> +const char *name;
>> +int offset;
>> +};
>> +
>> +#define REG_OFFSET(r) offsetof(struct pt_regs, r)
>> +#define REG_OFFSET_NAME(r) {.name = #r, .offset = REG_OFFSET(r)}
>> +#define REG_OFFSET_END {.name = NULL, .offset = 0}
>> +
>> +static struct pt_regs_offset regoffset_table[] = {
>> +#ifdef CONFIG_X86_64
>> +REG_OFFSET_NAME(r15),
>> +REG_OFFSET_NAME(r14),
>> +REG_OFFSET_NAME(r13),
>> +REG_OFFSET_NAME(r12),
>> +REG_OFFSET_NAME(r11),
>> +REG_OFFSET_NAME(r10),
>> +REG_OFFSET_NAME(r9),
>> +REG_OFFSET_NAME(r8),
>> +#endif
>> +REG_OFFSET_NAME(bx),
>> +REG_OFFSET_NAME(cx),
>> +REG_OFFSET_NAME(dx),
>> +REG_OFFSET_NAME(si),
>> +REG_OFFSET_NAME(di),
>> +REG_OFFSET_NAME(bp),
>> +REG_OFFSET_NAME(ax),
>> +#ifdef CONFIG_X86_32
>> +REG_OFFSET_NAME

[PATCH] kvm-autotest: add object addressing in sample cfg

2009-04-02 Thread Ryan Harper
The wiki documents[1] object addressing quite well, but we should
include it in the example config file as well.

1.  
http://www.linux-kvm.org/page/KVM-Autotest/Parameters#Addressing_objects_.28VMs.2C_images.2C_NICs_etc.29


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com


diffstat output:
 kvm_tests.cfg.sample |4 
 1 files changed, 4 insertions(+)

Signed-off-by: Ryan Harper 
---
diff --git a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample 
b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample
index 5619fa8..64f8e4b 100644
--- a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample
+++ b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample
@@ -19,6 +19,10 @@ image_size = 10G
 ssh_port = 22
 display = vnc
 
+# specify specific values for vm1 and nic1
+mem_vm1 = 256
+nic_model_nic1 = rtl8139
+
 # Port redirections
 redirs = ssh
 guest_port_ssh = 22
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-autotest: kvm_vm.py get values from nic_params dict

2009-04-02 Thread Ryan Harper
Looks like cut-n-paste error.  We should be fetching values from
nic_params, not image params.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com


diffstat output:
 kvm_vm.py |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

Signed-off-by: Ryan Harper 
---
diff --git a/client/tests/kvm_runtest_2/kvm_vm.py 
b/client/tests/kvm_runtest_2/kvm_vm.py
index 2e6599e..5136a26 100644
--- a/client/tests/kvm_runtest_2/kvm_vm.py
+++ b/client/tests/kvm_runtest_2/kvm_vm.py
@@ -189,8 +189,8 @@ class VM:
 for nic_name in kvm_utils.get_sub_dict_names(params, "nics"):
 nic_params = kvm_utils.get_sub_dict(params, nic_name)
 qemu_cmd += " -net nic,vlan=%d" % vlan
-if image_params.get("nic_model"):
-qemu_cmd += ",model=%s" % image_params.get("nic_model")
+if nic_params.get("nic_model"):
+qemu_cmd += ",model=%s" % nic_params.get("nic_model")
 qemu_cmd += " -net user,vlan=%d" % vlan
 vlan += 1
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip 2/6 V4] x86: add arch-dep register and stack access API to ptrace

2009-04-02 Thread Frederic Weisbecker
On Thu, Apr 02, 2009 at 01:24:47PM -0400, Masami Hiramatsu wrote:
> Add following APIs for accessing registers and stack entries from pt_regs.
> - query_register_offset(const char *name)
>Query the offset of "name" register.
> 
> - query_register_name(unsigned offset)
>Query the name of register by its offset.
> 
> - get_register(struct pt_regs *regs, unsigned offset)
>Get the value of a register by its offset.
> 
> - valid_stack_address(struct pt_regs *regs, unsigned long addr)
>Check the address is in the stack.
> 
> - get_stack_nth(struct pt_regs *reg, unsigned nth)
>Get Nth entry of the stack. (N >= 0)
> 
> - get_argument_nth(struct pt_regs *reg, unsigned nth)
>Get Nth argument at function call. (N >= 0)
> 
> Signed-off-by: Masami Hiramatsu 
> Cc: Steven Rostedt 
> Cc: Ananth N Mavinakayanahalli 
> Cc: Ingo Molnar 
> Cc: Frederic Weisbecker 
> ---
> 
>  arch/x86/include/asm/ptrace.h |   66 
> +
>  arch/x86/kernel/ptrace.c  |   59 +
>  2 files changed, 125 insertions(+), 0 deletions(-)
> 
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index aed0894..44773b8 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -7,6 +7,7 @@
> 
>  #ifdef __KERNEL__
>  #include 
> +#include 
>  #endif
> 
>  #ifndef __ASSEMBLY__
> @@ -215,6 +216,71 @@ static inline unsigned long user_stack_pointer(struct 
> pt_regs *regs)
>   return regs->sp;
>  }
> 
> +/* Query offset/name of register from its name/offset */
> +extern int query_register_offset(const char *name);
> +extern const char *query_register_name(unsigned offset);
> +#define MAX_REG_OFFSET (offsetof(struct pt_regs, sp))
> +
> +/* Get register value from its offset */
> +static inline unsigned long get_register(struct pt_regs *regs, unsigned 
> offset)
> +{
> + if (unlikely(offset > MAX_REG_OFFSET))
> + return 0;
> + return *(unsigned long *)((unsigned long)regs + offset);
> +}
> +
> +/* Check the address in the stack */
> +static inline int valid_stack_address(struct pt_regs *regs, unsigned long 
> addr)
> +{
> + return ((addr & ~(THREAD_SIZE - 1))  ==
> + (kernel_trap_sp(regs) & ~(THREAD_SIZE - 1)));
> +}
> +
> +/* Get Nth entry of the stack */
> +static inline unsigned long get_stack_nth(struct pt_regs *regs, unsigned n)
> +{
> + unsigned long *addr = (unsigned long *)kernel_trap_sp(regs);
> + addr += n;
> + if (valid_stack_address(regs, (unsigned long)addr))
> + return *addr;
> + else
> + return 0;
> +}
> +
> +/* Get Nth argument at function call */
> +static inline unsigned long get_argument_nth(struct pt_regs *regs, unsigned 
> n)
> +{
> +#ifdef CONFIG_X86_32
> +#define NR_REGPARMS 3
> + if (n < NR_REGPARMS) {
> + switch (n) {
> + case 0: return regs->ax;
> + case 1: return regs->dx;
> + case 2: return regs->cx;
> + }
> + return 0;
> +#else /* CONFIG_X86_64 */
> +#define NR_REGPARMS 6
> + if (n < NR_REGPARMS) {
> + switch (n) {
> + case 0: return regs->di;
> + case 1: return regs->si;
> + case 2: return regs->dx;
> + case 3: return regs->cx;
> + case 4: return regs->r8;
> + case 5: return regs->r9;
> + }
> + return 0;
> +#endif
> + } else {
> + /*
> +  * The typical case: arg n is on the stack.
> +  * (Note: stack[0] = return address, so skip it)
> +  */
> + return get_stack_nth(regs, 1 + n - NR_REGPARMS);
> + }
> +}
> +
>  /*
>   * These are defined as per linux/ptrace.h, which see.
>   */
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 5c6e463..8d65dcb 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -46,6 +46,65 @@ enum x86_regset {
>   REGSET_IOPERM32,
>  };
> 
> +struct pt_regs_offset {
> + const char *name;
> + int offset;
> +};
> +
> +#define REG_OFFSET(r) offsetof(struct pt_regs, r)
> +#define REG_OFFSET_NAME(r) {.name = #r, .offset = REG_OFFSET(r)}
> +#define REG_OFFSET_END {.name = NULL, .offset = 0}
> +
> +static struct pt_regs_offset regoffset_table[] = {
> +#ifdef CONFIG_X86_64
> + REG_OFFSET_NAME(r15),
> + REG_OFFSET_NAME(r14),
> + REG_OFFSET_NAME(r13),
> + REG_OFFSET_NAME(r12),
> + REG_OFFSET_NAME(r11),
> + REG_OFFSET_NAME(r10),
> + REG_OFFSET_NAME(r9),
> + REG_OFFSET_NAME(r8),
> +#endif
> + REG_OFFSET_NAME(bx),
> + REG_OFFSET_NAME(cx),
> + REG_OFFSET_NAME(dx),
> + REG_OFFSET_NAME(si),
> + REG_OFFSET_NAME(di),
> + REG_OFFSET_NAME(bp),
> + REG_OFFSET_NAME(ax),
> +#ifdef CONFIG_X86_32
> + REG_OFFSET_NAME(ds),
> + REG_OFFSET_NAME(es),
> + REG_OFFSET_NAME(fs),
> + REG_OFFSET_NAME(gs),
> +#endif
> + REG_OFF

Re: CPU Limits on KVM?

2009-04-02 Thread Brian Jackson
I haven't ever really used cgroups. I always figured a fair host scheduler is 
good enough to handle spreading load. So I don't know if it will fit exactly 
what you need. I don't think so. I also don't know of any other options. I 
will say, If I gave 4 VMs a single cpu each on a 4 core host, I would expect 
the host to be fully loaded. I wouldn't see any reason for the host not to be 
fully loaded. That is after all one of the key points of virtualization. 
Better utilization of hardware.



On Thursday 02 April 2009 17:33:07 Francisco Mazzeo wrote:
> Hello Brian,
>
>  Thanks for the reply. is there a wiki about cgroupds and how to set them
> up?
>
>  Also, I tried just for kicks to see what would happen if I create 4
> Virtual Windows machines, run prime95 (a tool that does iterations
> like superpi to stress test memory/cpu) on all of them and just assign
> them only ONE core to them.
>
>  The server node did not crash and you are right, however I was hoping
> for the server load to stay below 50% as I only gave it one single
> core to each KVM VE. Instead it seems like KVM let each VE get one
> slice of each of the 4 cores of my CPU, which did not accomplish what
> I wanted.
>
>  Is cgroupds the only choice available?
>
> -- Francisco
>
> On Thu, Apr 2, 2009 at 3:29 PM, Brian Jackson  wrote:
> > There's CPU cgroups. It doesn't have exactly the ability you are after,
> > but it is able to limit process(es) CPU usage. Maxing out CPU usage won't
> > crash your server. The kernel will arbitrate sharing the CPU evenly among
> > processes/VMs.
> >
> > --Brian Jackson
> >
> > On Thursday 02 April 2009 16:41:10 Francisco Mazzeo wrote:
> >> Hello,
> >>
> >>  I am a new user to KVM and was wondering if there was any way to
> >> limit a VE from using up all the resources of the processor.
> >>
> >>
> >>  Right now I have a Quad core 2.5Ghz, I have a KVM VE (running windows
> >> server 2003) and assigned 4 CPUs to it. If I max out the load for that
> >> VE, the entire host node load will be 100% which may crash it if I
> >> hosted more than 1 single VE.
> >>
> >>  OpenVZ has cpulimit command, does KVM have something similar or any
> >> way that I can implement a limit on a single VE? Say I want to only
> >> give a max of 500Mhz per core, to total 2Ghz to the VE.
> >>
> >> Thanks
> >> Francisco
> >> www.navigatoris.net / www.serversoutlet.com
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU Limits on KVM?

2009-04-02 Thread Brian Jackson
There's CPU cgroups. It doesn't have exactly the ability you are after, but it 
is able to limit process(es) CPU usage. Maxing out CPU usage won't crash your 
server. The kernel will arbitrate sharing the CPU evenly among processes/VMs.

--Brian Jackson


On Thursday 02 April 2009 16:41:10 Francisco Mazzeo wrote:
> Hello,
>
>  I am a new user to KVM and was wondering if there was any way to
> limit a VE from using up all the resources of the processor.
>
>
>  Right now I have a Quad core 2.5Ghz, I have a KVM VE (running windows
> server 2003) and assigned 4 CPUs to it. If I max out the load for that
> VE, the entire host node load will be 100% which may crash it if I
> hosted more than 1 single VE.
>
>  OpenVZ has cpulimit command, does KVM have something similar or any
> way that I can implement a limit on a single VE? Say I want to only
> give a max of 500Mhz per core, to total 2Ghz to the VE.
>
> Thanks
> Francisco
> www.navigatoris.net / www.serversoutlet.com
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU Limits on KVM?

2009-04-02 Thread Francisco Mazzeo
Hello,

 I am a new user to KVM and was wondering if there was any way to
limit a VE from using up all the resources of the processor.


 Right now I have a Quad core 2.5Ghz, I have a KVM VE (running windows
server 2003) and assigned 4 CPUs to it. If I max out the load for that
VE, the entire host node load will be 100% which may crash it if I
hosted more than 1 single VE.

 OpenVZ has cpulimit command, does KVM have something similar or any
way that I can implement a limit on a single VE? Say I want to only
give a max of 500Mhz per core, to total 2Ghz to the VE.

Thanks
Francisco
www.navigatoris.net / www.serversoutlet.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Commit 3d28613c225ba94062950dacbb2304b2d2024abc breaks linux boot

2009-04-02 Thread Ryan Harper
* Gleb Natapov  [2009-04-01 09:54]:
> On Wed, Apr 01, 2009 at 05:49:08PM +0300, Avi Kivity wrote:
> > Gleb Natapov wrote:
> >> Commit 3d28613c225ba94062950dacbb2304b2d2024abc break linux boot.
> >> It hangs after printing:
> >>  SMP alternatives: switching to UP code
> >>   
> >
> > Does dropping bit 8 from context->rsvd_bits_mask[0][1] (PT64_ROOT_LEVEL)  
> > help?
> >
> Yep.

tip is still broken for me, did a fix go in for this?

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-04-02 Thread Jesper Juhl
On Thu, 2 Apr 2009, Chris Wright wrote:

> * Jesper Juhl (j...@chaosbits.net) wrote:
> > Do you rely only on the checksum or do you actually compare pages to check 
> > they are 100% identical before sharing?
> 
> Checksum has absolutely nothing to do w/ finding if two pages match.
> It's only used as a heuristic to suggest whether a single page has
> changed.  If that page is changing we won't bother trying to find a
> match for it.  Here's an example of the life of a page w.r.t checksum.
> 
> 1. checksum = uninitialized
> 2. first time page is found, checksum it (checksum = A).
>if checksum has changed (uninitialize != A) don't go any further w/ that 
> page
> 3. next time page is found, checksum it (checksum = B).
>if checksum has change (A != B) don't go any further w/ that page
> 4. next time page is found, checksum it (checksum = B).
>if checksum has changed (B == B)...it hasn't, continue processing the
>page
> 
> later if a match is found in the tree (which is sorted by _contents_,
> i.e. memcmp) we'll attempt to merge the pages which at it's very core
> does:
> 
>   if (pages_identical(oldpage, newpage))
>   ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
> 
> pages_identical?  you guessed it...just does:
> 
>   r = memcmp(addr1, addr2, PAGE_SIZE)
> 

Thank you for that explanation, it set my mind at ease :-)


-- 
Jesper Juhl  http://www.chaosbits.net/
Plain text mails only, please  http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-04-02 Thread Chris Wright
* Jesper Juhl (j...@chaosbits.net) wrote:
> Do you rely only on the checksum or do you actually compare pages to check 
> they are 100% identical before sharing?

Checksum has absolutely nothing to do w/ finding if two pages match.
It's only used as a heuristic to suggest whether a single page has
changed.  If that page is changing we won't bother trying to find a
match for it.  Here's an example of the life of a page w.r.t checksum.

1. checksum = uninitialized
2. first time page is found, checksum it (checksum = A).
   if checksum has changed (uninitialize != A) don't go any further w/ that page
3. next time page is found, checksum it (checksum = B).
   if checksum has change (A != B) don't go any further w/ that page
4. next time page is found, checksum it (checksum = B).
   if checksum has changed (B == B)...it hasn't, continue processing the
   page

later if a match is found in the tree (which is sorted by _contents_,
i.e. memcmp) we'll attempt to merge the pages which at it's very core
does:

if (pages_identical(oldpage, newpage))
ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);

pages_identical?  you guessed it...just does:

r = memcmp(addr1, addr2, PAGE_SIZE)

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-04-02 Thread Izik Eidus

Jesper Juhl wrote:

Hi,

On Tue, 31 Mar 2009, Izik Eidus wrote:

  

KSM is a linux driver that allows dynamicly sharing identical memory
pages between one or more processes.

Unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and
merged.
The sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

To find identical pages ksm use algorithm that is split into three
primery levels:

1) Ksm will start scan the memory and will calculate checksum for each
   page that is registred to be scanned.
   (In the first round of the scanning, ksm would only calculate
this checksum for all the pages)




One question;

Calcolating a checksum is a fine way to find pages that are "likely to be 
identical"


I dont use checksum as with hash table, the checksum doesnt use to find 
identical pages by the way that they have similer data...
the checksum is used to let me know that the page was not changed for a 
while and it is worth checking for identical pages to it...
In the future we will want to use the page table dirty bit for it, as 
taking checksum is somewhat expensive


, but there is no guarantee that two pages with the same 
checksum really are identical - there *will* be checksum collisions 
eventually. So, I really hope that your implementation actually checks 
that two pages that it find that have identical checksums really are 100% 
identical by comparing them bit by bit before throwing one away.
  

We do that :-)

If you rely only on a checksum then eventually a user will get bitten by a 
checksum collision and, in the best case, something will crash, and in the 
worst case, data will silently be corrupted.


Do you rely only on the checksum or do you actually compare pages to check 
they are 100% identical before sharing?
  


I do 100% compare to the pages before i share them.

I must admit that I have not read through the patch to find the answer, I 
just read your description and became concerned.


  

Dont worry, me neither :-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-04-02 Thread Jesper Juhl
Hi,

On Tue, 31 Mar 2009, Izik Eidus wrote:

> KSM is a linux driver that allows dynamicly sharing identical memory
> pages between one or more processes.
> 
> Unlike tradtional page sharing that is made at the allocation of the
> memory, ksm do it dynamicly after the memory was created.
> Memory is periodically scanned; identical pages are identified and
> merged.
> The sharing is unnoticeable by the process that use this memory.
> (the shared pages are marked as readonly, and in case of write
> do_wp_page() take care to create new copy of the page)
> 
> To find identical pages ksm use algorithm that is split into three
> primery levels:
> 
> 1) Ksm will start scan the memory and will calculate checksum for each
>page that is registred to be scanned.
>(In the first round of the scanning, ksm would only calculate
> this checksum for all the pages)
> 

One question;

Calcolating a checksum is a fine way to find pages that are "likely to be 
identical", but there is no guarantee that two pages with the same 
checksum really are identical - there *will* be checksum collisions 
eventually. So, I really hope that your implementation actually checks 
that two pages that it find that have identical checksums really are 100% 
identical by comparing them bit by bit before throwing one away.
If you rely only on a checksum then eventually a user will get bitten by a 
checksum collision and, in the best case, something will crash, and in the 
worst case, data will silently be corrupted.

Do you rely only on the checksum or do you actually compare pages to check 
they are 100% identical before sharing?

I must admit that I have not read through the patch to find the answer, I 
just read your description and became concerned.

-- 
Jesper Juhl  http://www.chaosbits.net/
Plain text mails only, please  http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Update .gitignore

2009-04-02 Thread Jan Kiszka
Signed-off-by: Jan Kiszka 
---

 .gitignore |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/.gitignore b/.gitignore
index fcdc357..22a8200 100644
--- a/.gitignore
+++ b/.gitignore
@@ -53,10 +53,14 @@ kernel/x86/coalesced_mmio.[ch]
 kernel/x86/kvm_cache_regs.h
 kernel/x86/vtd.c
 kernel/x86/irq_comm.c
+kernel/x86/timer.c
+kernel/x86/kvm_timer.h
+kernel/x86/iommu.c
 qemu/pc-bios/extboot.bin
 qemu/qemu-doc.html
 qemu/*.[18]
 qemu/*.pod
 qemu/qemu-tech.html
+qemu/qemu-options.texi
 user/kvmtrace
 user/test/x86/bootstrap
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] extboot: silence compiler warning

2009-04-02 Thread Jan Kiszka
Signed-off-by: Jan Kiszka 
---

 qemu/hw/extboot.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu/hw/extboot.c b/qemu/hw/extboot.c
index 32e6226..13ffafa 100644
--- a/qemu/hw/extboot.c
+++ b/qemu/hw/extboot.c
@@ -77,8 +77,8 @@ static void extboot_write_cmd(void *opaque, uint32_t addr, 
uint32_t value)
 BlockDriverState *bs = opaque;
 int cylinders, heads, sectors, err;
 uint64_t nb_sectors;
-target_phys_addr_t pa;
-int blen;
+target_phys_addr_t pa = 0;
+int blen = 0;
 void *buf = NULL;
 
 if (cmd->type == 0x01 || cmd->type == 0x02) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Anthony Liguori

Avi Kivity wrote:

Anthony Liguori wrote:
I don't think we even need that to end this debate.  I'm convinced 
we have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This 
defies logic so I'm now looking to isolate why that is.


I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
were the big winner... I hate qemu sometimes.





What, this:


UDP_RR test was limited by CPU consumption.  QEMU was pegging a CPU with 
only about 4000 packets per second whereas the host could do 14000.  An 
oprofile run showed that phys_page_find/cpu_physical_memory_rw where at 
the top by a wide margin which makes little sense since virtio is zero 
copy in kvm-userspace today.


That leaves the ring queue accessors that used ld[wlq]_phys and friends 
that happen to make use of the above.  That led me to try this terrible 
hack below and low and beyond, we immediately jumped to 1 pps.  This 
only works because almost nothing uses ld[wlq]_phys in practice except 
for virtio so breaking it for the non-RAM case didn't matter.


We didn't encounter this before because when I changed this behavior, I 
tested streaming and ping.  Both remained the same.  You can only expose 
this issue if you first disable tx mitigation.


Anyway, if we're able to send this many packets, I suspect we'll be able 
to also handle much higher throughputs without TX mitigation so that's 
what I'm going to look at now.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 02/17] vbus: add virtual-bus definitions

2009-04-02 Thread Gregory Haskins
Hi Ben

Ben Hutchings wrote:
> On Tue, 2009-03-31 at 14:42 -0400, Gregory Haskins wrote:
> [...]
>   
>> +Create a device instance
>> +
>> +
>> +Devices are instantiated by again utilizing the /config/vbus configfs area.
>> +At first you may suspect that devices are created as subordinate objects of 
>> a
>> +bus/container instance, but you would be mistaken.
>> 
>
> This is kind of patronising; why don't you simply lay out how things
> _do_ work?
>   

Ya, point taken.  I think that was written really to myself, because my
first design *had* the device as a subordinate object.  Then I realized
later that I didn't like that design :)

I will fix this.

>   
>>  Devices are actually
>> +root-level objects in vbus specifically to allow greater flexibility in the
>> +association of a device.  For instance, it may be desirable to have a single
>> +device that spans multiple VMs (consider an ethernet switch, or a shared 
>> disk
>> +for a cluster).  Therefore, device lifecycles are managed by 
>> creating/deleting
>> +objects in /config/vbus/devices.
>> +
>> +Note: Creating a device instance is actually a two step process:  We need to
>> +give the device instance a unique name, and we also need to give it a 
>> specific
>> +device type.  It is hard to express both parameters using standard 
>> filesystem
>> +operations like mkdir, so the design decision was made to require performing
>> +the operation in two steps.
>> 
>
> How about exposing a subdir for each device class under
> /config/vbus/devices/ and allowing device creation only within those?
> Two-stage construction is a pain for both users and implementors.
>
>   
I am not sure I follow.  It sounds like you are suggesting exactly what
I do today.

> [...]
>   
>> +At this point, we are ready to roll.  Pid 4382 has access to a virtual-bus
>> +namespace with one device, id=0.  Its type is:
>> +
>> +# cat 
>> /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
>> +virtual-ethernet
>> +
>> +"virtual-ethernet"?  Why is it not "venet-tap"?  Device-classes are allowed 
>> to
>> 

I think I worded this awkwardly.  A device-class creates a
device-instance.  A device-instance registers one or more interfaces. 
There are device types (of which I would classify both the device-class
and its instantiated device object as the same "type"), and there are
interface types.  The interface types may overlap across different
device types, as demonstrated below.  I will update the doc to be more
clear, here (assuming I didn't muddle it up even more ;)

>> +register their interfaces under an id that is not required to be the same as
>> +their deviceclass.  This supports device polymorphism.   For instance,
>> +consider that an interface "virtual-ethernet" may provide basic 802.x packet
>> +exchange.  However, we could have various implementations of a device that
>> +supports the 802.x interface, while having various implementations behind
>> +them.
>> 
> [...]
>
> It seems to me that your "device-classes" correspond to drivers and
> "interfaces" correspond to device classes in the LDM.
I don't think that is quite right, but I might be missing your point. 
All of these objects exist on the "backend", of which there isnt a
specific precedent with LDM to express.  Normally in LDM, you would have
some kind of physical device object in the hardware (say a SATA disk),
and an LDM "block device" that represents it in software.  So we call
the LDM model for that disk a "device" but really its like a proxy or a
software representative of the actual device itself.  And I am not
knocking this designation, as I think it makes a lot of sense.

However, what I will point out is that what we are creating here in vbus
is more akin to the SATA disk itself, not the LDM "block device"
representation of the device.   There was no really great existing way
to express this type of object, which is why I had to create a new
namespace in sysfs.

To dig down into this a little further, the device and interface are
inextricably linked in a relationship very close to this "physical
device" concept.  Therefore the "driver" portion of LDM that you
referenced w.r.t. the device-class doesnt even enter the picture here
(that would actually be up in the guest or userspace, actually. 
Discussed below).

As an example, consider a e1000 network card.  The PCI-ID and REV for
the e1000 card and the associated ABI are like its "interface".  Whereas
if its a physical card plugged into a physical pci slot, or its an
emulated e1000 inside qemu-kvm are like its device-instance.  In theory,
I can substitute either device-instance transparently with any driver
that understands the ABI associated with the e1000 PCI-ID
interchangeably (assuming all the plumbing is there, etc).  Its the same
deal here.  Taking a little creative license here to use that example in
terms of vbus concepts, I would have a device-class type =
"physical-e1000-card", and another "qemu-e

Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> vbus (if I understand it right) is a whole package of things:
>>>
>>> - a way to enumerate, discover, and manage devices
>>> 
>>
>> Yes
>>  
>>> That part duplicates PCI
>>> 
>>
>> Yes, but the important thing to point out is it doesn't *replace*
>> PCI. It simply an alternative.
>>   
>
> Does it offer substantial benefits over PCI?  If not, it's just extra
> code.

First of all, do you think I would spend time designing it if I didn't
think so? :)

Second of all, I want to use vbus for other things that do not speak PCI
natively (like userspace for instance...and if I am gleaning this
correctly, lguest doesnt either).

PCI sounds good at first, but I believe its a false economy.  It was
designed, of course, to be a hardware solution, so it carries all this
baggage derived from hardware constraints that simply do not exist in a
pure software world and that have to be emulated.  Things like the fixed
length and centrally managed PCI-IDs, PIO config cycles, BARs,
pci-irq-routing, etc.  While emulation of PCI is invaluable for
executing unmodified guest, its not strictly necessary from a
paravirtual software perspective...PV software is inherently already
aware of its context and can therefore use the best mechanism
appropriate from a broader selection of choices.

If we insist that PCI is the only interface we can support and we want
to do something, say, in the kernel for instance, we have to have either
something like the ICH model in the kernel (and really all of the pci
chipset models that qemu supports), or a hacky hybrid userspace/kernel
solution.  I think this is what you are advocating, but im sorry. IMO
that's just gross and unecessary gunk.  Lets stop beating around the
bush and just define the 4-5 hypercall verbs we need and be done with
it.  :)

FYI: The guest support for this is not really *that* much code IMO.
 
 drivers/vbus/proxy/Makefile  |2
 drivers/vbus/proxy/kvm.c |  726 +

and plus, I'll gladly maintain it :)

I mean, its not like new buses do not get defined from time to time. 
Should the computing industry stop coming up with new bus types because
they are afraid that the windows ABI only speaks PCI?  No, they just
develop a new driver for whatever the bus is and be done with it.  This
is really no different.

>
> Note that virtio is not tied to PCI, so "vbus is generic" doesn't count.
Well, preserving the existing virtio-net on x86 ABI is tied to PCI,
which is what I was referring to.  Sorry for the confusion.

>
>>> and it would be pretty hard to convince me we need to move to
>>> something new
>>> 
>>
>> But thats just it.  You don't *need* to move.  The two can coexist side
>> by side peacefully.  "vbus" just ends up being another device that may
>> or may not be present, and that may or may not have devices on it.  In
>> fact, during all this testing I was booting my guest with "eth0" as
>> virtio-net, and "eth1" as venet.  The both worked totally fine and
>> harmoniously.  The guest simply discovers if vbus is supported via a
>> cpuid feature bit and dynamically adds it if present.
>>   
>
> I meant, move the development effort, testing, installed base, Windows
> drivers.

Again, I will maintain this feature, and its completely off to the
side.  Turn it off in the config, or do not enable it in qemu and its
like it never existed.  Worst case is it gets reverted if you don't like
it.  Aside from the last few kvm specific patches, the rest is no
different than the greater linux environment.  E.g. if I update the
venet driver upstream, its conceptually no different than someone else
updating e1000, right?

>
>>  
>>> .  virtio-pci (a) works,
>>> 
>> And it will continue to work
>>   
>
> So why add something new?

I was hoping this was becoming clear by now, but apparently I am doing a
poor job of articulating things. :(  I think we got bogged down in the
802.x performance discussion and lost sight of what we are trying to
accomplish with the core infrastructure.

So this core vbus infrastructure is for generic, in-kernel IO models. 
As a first pass, we have implemented a kvm-connector, which lets kvm
guest kernels have access to the bus.  We also have a userspace
connector (which I haven't pushed yet due to remaining issues being
ironed out) which allows userspace applications to interact with the
devices as well.  As a prototype, we built "venet" to show how it all works.

In the future, we want to use this infrastructure to build IO models for
various things like high performance fabrics and guest bypass
technologies, etc.  For instance, guest userspace connections to RDMA
devices in the kernel, etc.

>
>>  
>>> (b) works on Windows.
>>> 
>>
>> virtio will continue to work on windows, as well.  And if one of my
>> customers wants vbus support on windows and is willing to pay us to
>> develop it, we can support *it* there as well.
>>   
>
> I don't want to develop and support both v

[PATCH -tip 6/6 V4] tracing: kprobe-tracer plugin supports arguments

2009-04-02 Thread Masami Hiramatsu
Support following probe arguments and add fetch functions on kprobe-tracer.
  %REG  : Fetch register REG
  sN: Fetch Nth entry of stack (N >= 0)
  @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
  @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
  aN: Fetch function argument. (N >= 0)
  rv: Fetch return value.
  ra: Fetch return address.
  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.

changes from v3.1:
 - remove arch-dep code.
 - aN start with a0 instead of a1.
 - include symbol value fetching support.
 - support name based register fetching. (and remove rN)
 - support recursive indirect memory fetching.

Signed-off-by: Masami Hiramatsu 
Cc: Steven Rostedt 
Cc: Ananth N Mavinakayanahalli 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
---

 Documentation/ftrace.txt|   47 +++--
 kernel/trace/trace_kprobe.c |  431 +--
 2 files changed, 441 insertions(+), 37 deletions(-)


diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index fd0833b..c593780 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -1329,17 +1329,34 @@ current_tracer, instead of that, just set probe points 
via
 /debug/tracing/kprobe_probes.

 Synopsis of kprobe_probes:
-  p SYMBOL[+offs|-offs]|MEMADDR: set a probe
-  r SYMBOL[+0] : set a return probe
+  p SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS]: set a probe
+  r SYMBOL[+0] [FETCHARGS] : set a return probe
+
+ FETCHARGS:
+  %REG : Fetch register REG
+  sN   : Fetch Nth entry of stack (N >= 0)
+  @ADDR: Fetch memory at ADDR (ADDR should be in kernel)
+  @SYM[+|-offs]: Fetch memory at SYM +|- offs (SYM should be a data 
symbol)
+  aN   : Fetch function argument. (N >= 0)(*)
+  rv   : Fetch return value.(**)
+  ra   : Fetch return address.(**)
+  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***)
+
+  (*) aN may not correct on asmlinkaged functions and at the middle of
+  function body.
+  (**) only for return probe.
+  (***) this is useful for fetching a field of data structures.

 E.g.
-  echo p sys_open > /debug/tracing/kprobe_probes
+  echo p do_sys_open a0 a1 a2 a3 > /debug/tracing/kprobe_probes

- This sets a kprobe on the top of sys_open() function.
+ This sets a kprobe on the top of do_sys_open() function with recording
+1st to 4th arguments.

-  echo r sys_open >> /debug/tracing/kprobe_probes
+  echo r do_sys_open rv ra >> /debug/tracing/kprobe_probes

- This sets a kretprobe on the return point of sys_open() function.
+ This sets a kretprobe on the return point of do_sys_open() function with
+recording return value and return address.

   echo > /debug/tracing/kprobe_probes

@@ -1351,18 +1368,16 @@ E.g.
 #
 #   TASK-PIDCPU#TIMESTAMP  FUNCTION
 #  | |   |  | |
-   <...>-5117  [003]   416.481638: sys_open: @sys_open+0
-   <...>-5117  [003]   416.481662: syscall_call: <-sys_open+0
-   <...>-5117  [003]   416.481739: sys_open: @sys_open+0
-   <...>-5117  [003]   416.481762: sysenter_do_call: <-sys_open+0
-   <...>-5117  [003]   416.481818: sys_open: @sys_open+0
-   <...>-5117  [003]   416.481842: sysenter_do_call: <-sys_open+0
-   <...>-5117  [003]   416.481882: sys_open: @sys_open+0
-   <...>-5117  [003]   416.481905: sysenter_do_call: <-sys_open+0
+   <...>-2376  [001]   262.389131: do_sys_open: @do_sys_open+0 
0xff9c 0x98db83e 0x8880 0x0
+   <...>-2376  [001]   262.391166: sys_open: <-do_sys_open+0 0x5 
0xc06e8ebb
+   <...>-2376  [001]   264.384876: do_sys_open: @do_sys_open+0 
0xff9c 0x98db83e 0x8880 0x0
+   <...>-2376  [001]   264.386880: sys_open: <-do_sys_open+0 0x5 
0xc06e8ebb
+   <...>-2084  [001]   265.380330: do_sys_open: @do_sys_open+0 
0xff9c 0x804be3e 0x0 0x1b6
+   <...>-2084  [001]   265.380399: sys_open: <-do_sys_open+0 0x3 
0xc06e8ebb

  @SYMBOL means that kernel hits a probe, and <-SYMBOL means kernel returns
-from SYMBOL(e.g. "sysenter_do_call: <-sys_open+0" means kernel returns from
-sys_open to sysenter_do_call).
+from SYMBOL(e.g. "sys_open: <-do_sys_open+0" means kernel returns from
+do_sys_open to sys_open).


 function graph tracer
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 8263b5f..df71c6c 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -27,10 +27,134 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include "trace.h"

+/* currently, trace_kprobe only supports X86. */
+
+struct fetch_func {
+   unsigned long (*func)(struct pt_regs *, void *);
+   void *data;
+};
+
+static unsigned long call_fetch(struct fetch_func *f, struct pt_regs *regs)
+{
+   return f->func(regs, f->data);
+}
+
+/* fetch handlers */
+static unsigned long fetch_register(struct pt_regs *regs, void *offset)
+{
+   

[PATCH -tip 5/6 V4] tracing: kprobe-tracer plugin core

2009-04-02 Thread Masami Hiramatsu
Add kprobes based event tracer on ftrace.

This tracer is similar to the events tracer which is based on Tracepoint
infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
and kretprobe). It probes anywhere where kprobes can probe(this means, all
functions body except for __kprobes functions).

changes from v3:
  - warn if the probe address is not an instruction boundary.

Signed-off-by: Masami Hiramatsu 
Cc: Steven Rostedt 
Cc: Ananth N Mavinakayanahalli 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
---

 Documentation/ftrace.txt|   55 ++
 kernel/trace/Kconfig|9 +
 kernel/trace/Makefile   |1
 kernel/trace/trace_kprobe.c |  400 +++
 4 files changed, 465 insertions(+), 0 deletions(-)
 create mode 100644 kernel/trace/trace_kprobe.c


diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index fd9a3e6..fd0833b 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -1310,6 +1310,61 @@ dereference in a kernel module:
 [...]


+kprobe-based event tracer
+---
+
+This tracer is similar to the events tracer which is based on Tracepoint
+infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
+and kretprobe). It probes anywhere where kprobes can probe(this means, all
+functions body except for __kprobes functions).
+
+Unlike the function tracer, this tracer can probe instructions inside of
+kernel functions. It allows you to check which instruction has been executed.
+
+Unlike the Tracepoint based events tracer, this tracer can add new probe points
+on the fly.
+
+Similar to the events tracer, this tracer doesn't need to be activated via
+current_tracer, instead of that, just set probe points via
+/debug/tracing/kprobe_probes.
+
+Synopsis of kprobe_probes:
+  p SYMBOL[+offs|-offs]|MEMADDR: set a probe
+  r SYMBOL[+0] : set a return probe
+
+E.g.
+  echo p sys_open > /debug/tracing/kprobe_probes
+
+ This sets a kprobe on the top of sys_open() function.
+
+  echo r sys_open >> /debug/tracing/kprobe_probes
+
+ This sets a kretprobe on the return point of sys_open() function.
+
+  echo > /debug/tracing/kprobe_probes
+
+ This clears all probe points. and you can see the traced information via
+/debug/tracing/trace.
+
+  cat /debug/tracing/trace
+# tracer: nop
+#
+#   TASK-PIDCPU#TIMESTAMP  FUNCTION
+#  | |   |  | |
+   <...>-5117  [003]   416.481638: sys_open: @sys_open+0
+   <...>-5117  [003]   416.481662: syscall_call: <-sys_open+0
+   <...>-5117  [003]   416.481739: sys_open: @sys_open+0
+   <...>-5117  [003]   416.481762: sysenter_do_call: <-sys_open+0
+   <...>-5117  [003]   416.481818: sys_open: @sys_open+0
+   <...>-5117  [003]   416.481842: sysenter_do_call: <-sys_open+0
+   <...>-5117  [003]   416.481882: sys_open: @sys_open+0
+   <...>-5117  [003]   416.481905: sysenter_do_call: <-sys_open+0
+
+ @SYMBOL means that kernel hits a probe, and <-SYMBOL means kernel returns
+from SYMBOL(e.g. "sysenter_do_call: <-sys_open+0" means kernel returns from
+sys_open to sysenter_do_call).
+
+
 function graph tracer
 ---

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8a4d729..becd8ed 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -357,6 +357,15 @@ config BLK_DEV_IO_TRACE

  If unsure, say N.

+config KPROBE_TRACER
+   depends on KPROBES
+   depends on X86
+   bool "Trace kprobes"
+   select TRACING
+   help
+ This tracer probes everywhere where kprobes can probe it, and
+ records various registers and memories specified by user.
+
 config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 2630f51..f39a26b 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -46,5 +46,6 @@ obj-$(CONFIG_EVENT_TRACER) += trace_export.o
 obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
 obj-$(CONFIG_EVENT_PROFILE) += trace_event_profile.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events_filter.o
+obj-$(CONFIG_KPROBE_TRACER) += trace_kprobe.o

 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
new file mode 100644
index 000..8263b5f
--- /dev/null
+++ b/kernel/trace/trace_kprobe.c
@@ -0,0 +1,400 @@
+/*
+ * kprobe based kernel tracer
+ *
+ * Created by Masami Hiramatsu 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU G

[PATCH -tip 4/6 V4] x86: kprobes checks safeness of insertion address.

2009-04-02 Thread Masami Hiramatsu
Ensure safeness of inserting kprobes by checking whether the specified
address is at the first byte of a instruction. This is done by decoding
probed function from its head to the probe point.

Signed-off-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Jim Keniston 
Cc: Ingo Molnar 
---

 arch/x86/kernel/kprobes.c |   51 +
 1 files changed, 51 insertions(+), 0 deletions(-)


diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 7b5169d..39c79cc 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -48,12 +48,14 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
 #include 
 #include 
 #include 
+#include 

 void jprobe_return_end(void);

@@ -244,6 +246,53 @@ retry:
}
 }

+/* Recover original instruction */
+static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long addr)
+{
+   struct kprobe *kp;
+   kp = get_kprobe((void *)addr);
+   if (!kp)
+   return -EINVAL;
+
+   /* Don't use p->ainsn.insn; which will be modified by fix_riprel */
+   memcpy(buf, kp->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+   buf[0] = kp->opcode;
+   return 0;
+}
+
+/* Dummy buffers for lookup_symbol_attrs */
+static char __dummy_buf[KSYM_NAME_LEN];
+
+/* Check whether the address can be probed */
+static int __kprobes can_probe(unsigned long paddr)
+{
+   int ret;
+   unsigned long addr, offset = 0;
+   struct insn insn;
+   kprobe_opcode_t buf[MAX_INSN_SIZE];
+
+   /* Lookup symbol including addr */
+   if (!kallsyms_lookup(paddr, NULL, &offset, NULL, __dummy_buf))
+   return 0;
+
+   /* Decode instructions */
+   addr = paddr - offset;
+   while (addr < paddr) {
+   insn_init_kernel(&insn, (void *)addr);
+   insn_get_opcode(&insn);
+   if (OPCODE1(&insn) == BREAKPOINT_INSTRUCTION) {
+   ret = recover_probed_instruction(buf, addr);
+   if (ret)
+   return 0;
+   insn_init_kernel(&insn, buf);
+   }
+   insn_get_length(&insn);
+   addr += insn.length;
+   }
+
+   return (addr == paddr);
+}
+
 /*
  * Returns non-zero if opcode modifies the interrupt flag.
  */
@@ -359,6 +408,8 @@ static void __kprobes arch_copy_kprobe(struct kprobe *p)

 int __kprobes arch_prepare_kprobe(struct kprobe *p)
 {
+   if (!can_probe((unsigned long)p->addr))
+   return -EILSEQ;
/* insn: must be on special executable page on x86. */
p->ainsn.insn = get_insn_slot();
if (!p->ainsn.insn)
-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhira...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -tip 3/6 V4] x86: instruction decorder API

2009-04-02 Thread Masami Hiramatsu
Add x86 instruction decoder to arch-specific libraries. This decoder
can decode all x86 instructions into prefix, opcode, modrm, sib,
displacement and immediates. This can also show the length of
instructions.

Signed-off-by: Jim Keniston 
Signed-off-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Ingo Molnar 
Cc: Andi Kleen 
Cc: kvm@vger.kernel.org
---

 arch/x86/include/asm/insn.h |  130 +
 arch/x86/lib/Makefile   |1
 arch/x86/lib/insn.c |  627 +++
 3 files changed, 758 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/insn.h
 create mode 100644 arch/x86/lib/insn.c


diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
new file mode 100644
index 000..488001f
--- /dev/null
+++ b/arch/x86/include/asm/insn.h
@@ -0,0 +1,130 @@
+#ifndef _ASM_X86_INSN_H
+#define _ASM_X86_INSN_H
+/*
+ * x86 instruction analysis
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ */
+
+#include 
+
+/* legacy instruction prefixes */
+#define X86_PFX_OPNDSZ 0x1 /* 0x66 */
+#define X86_PFX_ADDRSZ 0x2 /* 0x67 */
+#define X86_PFX_CS 0x4 /* 0x2E */
+#define X86_PFX_DS 0x8 /* 0x3E */
+#define X86_PFX_ES 0x10/* 0x26 */
+#define X86_PFX_FS 0x20/* 0x64 */
+#define X86_PFX_GS 0x40/* 0x65 */
+#define X86_PFX_SS 0x80/* 0x36 */
+#define X86_PFX_LOCK   0x100   /* 0xF0 */
+#define X86_PFX_REPE   0x200   /* 0xF3 */
+#define X86_PFX_REPNE  0x400   /* 0xF2 */
+/* REX prefix */
+#define X86_PFX_REX0x800   /* 0x4X */
+/* REX prefix dissected */
+#define X86_PFX_REX_BASE 0x1000
+#define X86_PFX_REXB   0x1000  /* 0x41 bit */
+#define X86_PFX_REXX   0x2000  /* 0x42 bit */
+#define X86_PFX_REXR   0x4000  /* 0x44 bit */
+#define X86_PFX_REXW   0x8000  /* 0x48 bit */
+
+struct insn_field {
+   union {
+   s32 value;
+   u8 bytes[4];
+   };
+   bool got;   /* true if we've run insn_get_xxx() for this field */
+   u8 nbytes;
+};
+
+struct insn {
+   struct insn_field prefixes; /* prefixes.value is a bitmap */
+   struct insn_field opcode;   /*
+* opcode.bytes[0]: opcode1
+* opcode.bytes[1]: opcode2
+* opcode.bytes[2]: opcode3
+*/
+   struct insn_field modrm;
+   struct insn_field sib;
+   struct insn_field displacement;
+   union {
+   struct insn_field immediate;
+   struct insn_field moffset1; /* for 64bit MOV */
+   struct insn_field immediate1;   /* for 64bit imm or off16/32 */
+   };
+   union {
+   struct insn_field moffset2; /* for 64bit MOV */
+   struct insn_field immediate2;   /* for 64bit imm or seg16 */
+   };
+
+   u8 opnd_bytes;
+   u8 addr_bytes;
+   u8 length;
+   bool x86_64;
+
+   const u8 *kaddr;/* kernel address of insn (copy) to analyze */
+   const u8 *next_byte;
+};
+
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+
+#define MODRM_MOD(insn) (((insn)->modrm.value & 0xc0) >> 6)
+#define MODRM_REG(insn) (((insn)->modrm.value & 0x38) >> 3)
+#define MODRM_RM(insn) ((insn)->modrm.value & 0x07)
+
+#define SIB_SCALE(insn) (((insn)->sib.value & 0xc0) >> 6)
+#define SIB_INDEX(insn) (((insn)->sib.value & 0x38) >> 3)
+#define SIB_BASE(insn) ((insn)->sib.value & 0x07)
+
+#define MOFFSET64(insn)(((u64)((insn)->moffset2.value) << 32) | \
+ (u32)((insn)->moffset1.value))
+
+#define IMMEDIATE64(insn)  (((u64)((insn)->immediate2.value) << 32) | \
+ (u32)((insn)->immediate1.value))
+
+extern void insn_init(struct insn *insn, const u8 *kaddr, bool x86_64);
+extern void insn_get_prefixes(struct insn *insn);
+extern void insn_get_opcode(struct insn *insn);
+extern void insn_get_modrm(struct insn *insn);
+extern void insn_get_sib(struct insn *insn);
+extern void insn_get_displacement(struct insn *insn);
+extern void insn_get_immediate(struct insn *insn);
+extern void

[PATCH -tip 2/6 V4] x86: add arch-dep register and stack access API to ptrace

2009-04-02 Thread Masami Hiramatsu
Add following APIs for accessing registers and stack entries from pt_regs.
- query_register_offset(const char *name)
   Query the offset of "name" register.

- query_register_name(unsigned offset)
   Query the name of register by its offset.

- get_register(struct pt_regs *regs, unsigned offset)
   Get the value of a register by its offset.

- valid_stack_address(struct pt_regs *regs, unsigned long addr)
   Check the address is in the stack.

- get_stack_nth(struct pt_regs *reg, unsigned nth)
   Get Nth entry of the stack. (N >= 0)

- get_argument_nth(struct pt_regs *reg, unsigned nth)
   Get Nth argument at function call. (N >= 0)

Signed-off-by: Masami Hiramatsu 
Cc: Steven Rostedt 
Cc: Ananth N Mavinakayanahalli 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
---

 arch/x86/include/asm/ptrace.h |   66 +
 arch/x86/kernel/ptrace.c  |   59 +
 2 files changed, 125 insertions(+), 0 deletions(-)


diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index aed0894..44773b8 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -7,6 +7,7 @@

 #ifdef __KERNEL__
 #include 
+#include 
 #endif

 #ifndef __ASSEMBLY__
@@ -215,6 +216,71 @@ static inline unsigned long user_stack_pointer(struct 
pt_regs *regs)
return regs->sp;
 }

+/* Query offset/name of register from its name/offset */
+extern int query_register_offset(const char *name);
+extern const char *query_register_name(unsigned offset);
+#define MAX_REG_OFFSET (offsetof(struct pt_regs, sp))
+
+/* Get register value from its offset */
+static inline unsigned long get_register(struct pt_regs *regs, unsigned offset)
+{
+   if (unlikely(offset > MAX_REG_OFFSET))
+   return 0;
+   return *(unsigned long *)((unsigned long)regs + offset);
+}
+
+/* Check the address in the stack */
+static inline int valid_stack_address(struct pt_regs *regs, unsigned long addr)
+{
+   return ((addr & ~(THREAD_SIZE - 1))  ==
+   (kernel_trap_sp(regs) & ~(THREAD_SIZE - 1)));
+}
+
+/* Get Nth entry of the stack */
+static inline unsigned long get_stack_nth(struct pt_regs *regs, unsigned n)
+{
+   unsigned long *addr = (unsigned long *)kernel_trap_sp(regs);
+   addr += n;
+   if (valid_stack_address(regs, (unsigned long)addr))
+   return *addr;
+   else
+   return 0;
+}
+
+/* Get Nth argument at function call */
+static inline unsigned long get_argument_nth(struct pt_regs *regs, unsigned n)
+{
+#ifdef CONFIG_X86_32
+#define NR_REGPARMS 3
+   if (n < NR_REGPARMS) {
+   switch (n) {
+   case 0: return regs->ax;
+   case 1: return regs->dx;
+   case 2: return regs->cx;
+   }
+   return 0;
+#else /* CONFIG_X86_64 */
+#define NR_REGPARMS 6
+   if (n < NR_REGPARMS) {
+   switch (n) {
+   case 0: return regs->di;
+   case 1: return regs->si;
+   case 2: return regs->dx;
+   case 3: return regs->cx;
+   case 4: return regs->r8;
+   case 5: return regs->r9;
+   }
+   return 0;
+#endif
+   } else {
+   /*
+* The typical case: arg n is on the stack.
+* (Note: stack[0] = return address, so skip it)
+*/
+   return get_stack_nth(regs, 1 + n - NR_REGPARMS);
+   }
+}
+
 /*
  * These are defined as per linux/ptrace.h, which see.
  */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5c6e463..8d65dcb 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -46,6 +46,65 @@ enum x86_regset {
REGSET_IOPERM32,
 };

+struct pt_regs_offset {
+   const char *name;
+   int offset;
+};
+
+#define REG_OFFSET(r) offsetof(struct pt_regs, r)
+#define REG_OFFSET_NAME(r) {.name = #r, .offset = REG_OFFSET(r)}
+#define REG_OFFSET_END {.name = NULL, .offset = 0}
+
+static struct pt_regs_offset regoffset_table[] = {
+#ifdef CONFIG_X86_64
+   REG_OFFSET_NAME(r15),
+   REG_OFFSET_NAME(r14),
+   REG_OFFSET_NAME(r13),
+   REG_OFFSET_NAME(r12),
+   REG_OFFSET_NAME(r11),
+   REG_OFFSET_NAME(r10),
+   REG_OFFSET_NAME(r9),
+   REG_OFFSET_NAME(r8),
+#endif
+   REG_OFFSET_NAME(bx),
+   REG_OFFSET_NAME(cx),
+   REG_OFFSET_NAME(dx),
+   REG_OFFSET_NAME(si),
+   REG_OFFSET_NAME(di),
+   REG_OFFSET_NAME(bp),
+   REG_OFFSET_NAME(ax),
+#ifdef CONFIG_X86_32
+   REG_OFFSET_NAME(ds),
+   REG_OFFSET_NAME(es),
+   REG_OFFSET_NAME(fs),
+   REG_OFFSET_NAME(gs),
+#endif
+   REG_OFFSET_NAME(orig_ax),
+   REG_OFFSET_NAME(ip),
+   REG_OFFSET_NAME(cs),
+   REG_OFFSET_NAME(flags),
+   REG_OFFSET_NAME(sp),
+   REG_OFFSET_END,
+};
+
+int query_register_offset(const char *name)
+{
+   struct pt_regs_offset *roff = regoffset_table;
+

[PATCH -tip 1/6 V4] x86: fix kernel_trap_sp()

2009-04-02 Thread Masami Hiramatsu
Use ®s->sp instead of regs for getting the top of stack in kernel mode.
(on x86-64, regs->sp always points the top of stack)

Signed-off-by: Masami Hiramatsu 
Cc: Harvey Harrison 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Jan Blunck 
---

 arch/x86/include/asm/ptrace.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index e304b66..aed0894 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -187,14 +187,14 @@ static inline int v8086_mode(struct pt_regs *regs)

 /*
  * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
- * when it traps.  So regs will be the current sp.
+ * when it traps.  So ®s->sp will be the current sp.
  *
  * This is valid only for kernel mode traps.
  */
 static inline unsigned long kernel_trap_sp(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_32
-   return (unsigned long)regs;
+   return (unsigned long)®s->sp;
 #else
return regs->sp;
 #endif
-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhira...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio_net: MAC address releated breakage if there is no MAC area in config

2009-04-02 Thread Christian Borntraeger
Am Thursday 02 April 2009 18:06:25 schrieb Alex Williamson:
> On Thu, 2009-04-02 at 13:33 +0200, Christian Borntraeger wrote:
> > I read this as the mac config field is optional (similar to all the optional
> > fields we added in virtio_blk later).
[...]
> Sorry for the breakage.  My interpretation of the virtio-net config
> space was that the mac field was always present and the host had
> programmed a valid value when the F_MAC feature is available.  However,
> from the history of the flag, it seems like you're interpretation is
> likely correct.  Setting the config value from the randomly generated
> mac was largely opportunistic since there's no userspace that doesn't
> provide a mac by default.  So perhaps we can drop that and gate the
> set_mac_address entry point as shown below.  How does this look?
> Thanks,

that patch would solve the my problem. Thanks.

In addition, I will change our hypervisor sample code, to provide the
config space even if we do not set a MAC address in the host. Better
safe than sorry.

[...]
> virtio_net: Set the mac config only when VIRITO_NET_F_MAC
> 
> VIRTIO_NET_F_MAC indicates the presence of the mac field in config
> space, not the validity of the value it contains.  Allow the mac to be
> changed at runtime, but only push the change into config space with the
> VIRTIO_NET_F_MAC feature present.
> 
> Signed-off-by: Alex Williamson 

Acked-by: Christian Borntraeger 

> --
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a6f1e19..9c82a39 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -575,8 +575,9 @@ static int virtnet_set_mac_address(struct net_device 
> *dev, void *p)
>   if (ret)
>   return ret;
> 
> - vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
> -   dev->dev_addr, dev->addr_len);
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC))
> + vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
> +   dev->dev_addr, dev->addr_len);
> 
>   return 0;
>  }
> @@ -876,11 +877,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>   vdev->config->get(vdev,
> offsetof(struct virtio_net_config, mac),
> dev->dev_addr, dev->addr_len);
> - } else {
> + } else
>   random_ether_addr(dev->dev_addr);
> - vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
> -   dev->dev_addr, dev->addr_len);
> - }
> 
>   /* Set up our device-specific information */
>   vi = netdev_priv(dev);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -tip 0/6 V4] tracing: kprobe-based event tracer

2009-04-02 Thread Masami Hiramatsu
Hi,

Here are the patches of kprobe-based event tracer for x86, version 4.

This version supports only x86(-32/-64) (If someone is interested in
porting this to other architectures, he just needs to port
kprobes/kretprobes and ptrace enhancement[PATCH 2/6]).

I added x86 insn decoder on this version. It might be better
integrated with KVM's decoder, and kprobes x86 code should be
rewritten with it.


This can be applied on the linux-2.6-tip tree.

This patchset includes following changes:
- Fix kernel_trap_sp() on x86 according to systemtap runtime. [1/6]
- Add arch-dep register and stack fetching functions [2/6]
- Add x86 instruction decoder [3/6]
- Check insertion point safety in kprobe [4/6]
- Add kprobe-tracer plugin [5/6]
- Support fetching various status (register/stack/memory/etc.) [6/6]

Done items:
- Add kernel_trap_sp() and fetch_*() on other archs.
- Support name-based register fetching (ax, bx, and so on)
- Support indirect memory fetch from registers etc.
- Check insertion point safety by using instruction decoder.

Future items:
- .init function tracing support.
- Support primitive types(long, ulong, int, uint, etc) for args.


kprobe-based event tracer
---

This tracer is similar to the events tracer which is based on Tracepoint
infrastructure. Instead of Tracepoint, this tracer is based on kprobes(kprobe
and kretprobe). It probes anywhere where kprobes can probe(this means, all
functions body except for __kprobes functions).

Unlike the function tracer, this tracer can probe instructions inside of
kernel functions. It allows you to check which instruction has been executed.

Unlike the Tracepoint based events tracer, this tracer can add new probe points
on the fly.

Similar to the events tracer, this tracer doesn't need to be activated via
current_tracer, instead of that, just set probe points via
/debug/tracing/kprobe_probes.

Synopsis of kprobe_probes:
  p SYMBOL[+offs|-offs]|MEMADDR [FETCHARGS] : set a probe
  r SYMBOL[+0] [FETCHARGS]  : set a return probe

 FETCHARGS:
  %REG  : Fetch register REG
  sN: Fetch Nth entry of stack (N >= 0)
  @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
  @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
  aN: Fetch function argument. (N >= 0)(*)
  rv: Fetch return value.(**)
  ra: Fetch return address.(**)
  +|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.(***)

  (*) aN may not correct on asmlinkaged functions and at the middle of
  function body.
  (**) only for return probe.
  (***) this is useful for fetching a field of data structures.

E.g.
  echo p do_sys_open a0 a1 a2 a3 > /debug/tracing/kprobe_probes

 This sets a kprobe on the top of do_sys_open() function with recording
1st to 4th arguments.

  echo r do_sys_open rv rp >> /debug/tracing/kprobe_probes

 This sets a kretprobe on the return point of do_sys_open() function with
recording return value and return address.

  echo > /debug/tracing/kprobe_probes

 This clears all probe points. and you can see the traced information via
/debug/tracing/trace.

  cat /debug/tracing/trace
# tracer: nop
#
#   TASK-PIDCPU#TIMESTAMP  FUNCTION
#  | |   |  | |
   <...>-2376  [001]   262.389131: do_sys_open: @do_sys_open+0 
0xff9c 0x98db83e 0x8880 0x0
   <...>-2376  [001]   262.391166: sys_open: <-do_sys_open+0 0x5 
0xc06e8ebb
   <...>-2376  [001]   264.384876: do_sys_open: @do_sys_open+0 
0xff9c 0x98db83e 0x8880 0x0
   <...>-2376  [001]   264.386880: sys_open: <-do_sys_open+0 0x5 
0xc06e8ebb
   <...>-2084  [001]   265.380330: do_sys_open: @do_sys_open+0 
0xff9c 0x804be3e 0x0 0x1b6
   <...>-2084  [001]   265.380399: sys_open: <-do_sys_open+0 0x3 
0xc06e8ebb

 @SYMBOL means that kernel hits a probe, and <-SYMBOL means kernel returns
from SYMBOL(e.g. "sys_open: <-do_sys_open+0" means kernel returns from
do_sys_open to sys_open).


 Documentation/ftrace.txt  |   70 
 arch/x86/include/asm/insn.h   |  130 +++
 arch/x86/include/asm/ptrace.h |   70 -
 arch/x86/kernel/kprobes.c |   51 +++
 arch/x86/kernel/ptrace.c  |   59 +++
 arch/x86/lib/Makefile |1 +
 arch/x86/lib/insn.c   |  627 
 kernel/trace/Kconfig  |9 +
 kernel/trace/Makefile |1 +
 kernel/trace/trace_kprobe.c   |  789 +
 10 files changed, 1805 insertions(+), 2 deletions(-)

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhira...@redhat.com



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Fri, Apr 03, 2009 at 01:06:10AM +0800, Herbert Xu wrote:
>
> That only happens if the guest immediately does some CPU-intensive
> computation 3ms and assuming its timeslice lasts that long.
> 
> In any case, the same thing will happen right now if the host or
> some other guest on the same CPU hogs the CPU for 3ms.

Even better, look at the packet's TOS.  If it's marked for low-
latency then vmexit immediately.  Otherwise continue.

In the backend you'd just set the marker in shared memory.

Of course invert this for the host => guest direction.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote:
>
> 3ms latency for ping?
>
> (ping will always be scheduled immediately when the reply arrives if I  
> understand cfs, so guest load won't delay it)

That only happens if the guest immediately does some CPU-intensive
computation 3ms and assuming its timeslice lasts that long.

In any case, the same thing will happen right now if the host or
some other guest on the same CPU hogs the CPU for 3ms.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI passtthrought & intel 82574L can't boot from disk

2009-04-02 Thread Brian Jackson
It is my understanding that you need vt-d/iommu support. I didn't think any 
existing amd chipsets had iommu support. You may want to look into that.

--Brian Jackson


On Thursday 02 April 2009 07:00:07 Hauke Hoffmann wrote:
> Hi,
>
> qemu-system-x86_64 runs well and i can boot and run the guest system. Thats
> works very well.
>
> Command:
> /usr/local/kvm/bin/qemu-system-x86_64 -m
> 512 -hda /var/VM/roadrunner.local/hda.qcow2 -smp 1 -vnc
> 192.168.2.30: -net nic,macaddr=DE:AD:BE:EF:90:26 -net
> tap,ifname=tap0,script=no,downscript=no -boot c
>
> Then i tried to add an intel 82574L network adapter to the guest.
> Just the same command with addtionally "-pcidevice host=07:00.0"
>
> Then i connected via VNC and see BIOS startpage and the following lines:
> Initializing Intel(r) boot agent ge v1.3.21
> pxe 2.1 build 086 (WfM 2.0)
> Press f12 for moot menu
>
> You can see a screenshot at http://nxt7.de/download/qemu.png
>
> The guests keep on this point and nothing changes. (I have wait hours.)
>
> I tried to press F12 in ThightVNC but no action.
> I must say that ThightVNC has problems with special chars (in my case).
>
> At this point, i need your help.
>
>
> Here are some details of my system
>
> Kernel: 2.6.29 form kernel.org (self compiled)
> kvm userspace: kvm-84 (self compiled)
> OS: Ubuntu 8.04.2 server
>
> r...@ls:~# lspci
> 00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
> 00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
> 00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
> 00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
> 00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
> 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
> 00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
> 00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
> 00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
> 00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
> 00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
> 00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
> 00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
> 00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
> 00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
> 00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
> 00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
> 00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> HyperTransport Technology Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> DRAM Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> Miscellaneous Control
> 01:09.0 Ethernet controller: Lite-On Communications Inc LNE100TX [Linksys
> EtherFast 10/100] (rev 25)
> 01:0a.0 VGA compatible controller: XGI Technology Inc. (eXtreme Graphics
> Innovation) Volari Z7
> 06:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363
> AHCI Controller (rev 03)
> 06:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 AHCI
> Controller (rev 03)
> 07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
> Connection
>
>
> r...@ls:~# lspci -tvvv
> -[:00]-+-00.0  nVidia Corporation MCP55 Memory Controller
>+-01.0  nVidia Corporation MCP55 LPC Bridge
>+-01.1  nVidia Corporation MCP55 SMBus
>+-02.0  nVidia Corporation MCP55 USB Controller
>+-02.1  nVidia Corporation MCP55 USB Controller
>+-04.0  nVidia Corporation MCP55 IDE
>+-05.0  nVidia Corporation MCP55 SATA Controller
>+-05.1  nVidia Corporation MCP55 SATA Controller
>+-05.2  nVidia Corporation MCP55 SATA Controller
>+-06.0-[:01]--+-09.0  Lite-On Communications Inc LNE100TX
> [Linksys EtherFast 10/100]
>
>| \-0a.0  XGI Technology Inc. (eXtreme Graphics
>
> Innovation) Volari Z7
>+-08.0  nVidia Corporation MCP55 Ethernet
>+-09.0  nVidia Corporation MCP55 Ethernet
>+-0a.0-[:02]--
>+-0b.0-[:03]--
>+-0c.0-[:04]--
>+-0d.0-[:05]--
>+-0e.0-[:06]--+-00.0  JMicron Technologies, Inc. JMicron
> 20360/20363 AHCI Controller
>
>| \-00.1  JMicron Technologies, Inc. JMicron
>
> 20360/20363 AHCI Controller
>+-0f.0-[:07]00.0  Intel Corporation 82574L Gigabit
> Network Connection
>+-18.0  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> HyperTransport Technology Configuration
>+-18.1  Advanced Micro Devices [AMD] K8 [Athlo

Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Herbert Xu wrote:

On Thu, Apr 02, 2009 at 06:49:22PM +0300, Avi Kivity wrote:
  
I still think you want one MSI per device rather than one MSI per vbus,  
to avoid scaling problems on large guest.  After Herbert's let loose on  
the code, one MSI per queue.



Yes, one MSI per TX queue, and one per RX queue :)

  


We're currently limited to 1024, so go wild :)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Herbert Xu wrote:

On Thu, Apr 02, 2009 at 06:57:38PM +0300, Avi Kivity wrote:
  
What if the guest sends N packets, then does some expensive computation  
(say the guest scheduler switches from the benchmark process to  
evolution).  So now we have the marker set at packet N, but the host  
will not see it until the guest timeslice is up?



Well that's fine.  The guest will use up the remainder of its
timeslice.  After all we only have one core/hyperthread here so
this is no different than if the packets were held up higher up
in the guest kernel and the guest decided to do some computation.

  


3ms latency for ping?

(ping will always be scheduled immediately when the reply arrives if I 
understand cfs, so guest load won't delay it)


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 06:49:22PM +0300, Avi Kivity wrote:
>
> I still think you want one MSI per device rather than one MSI per vbus,  
> to avoid scaling problems on large guest.  After Herbert's let loose on  
> the code, one MSI per queue.

Yes, one MSI per TX queue, and one per RX queue :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 06:57:38PM +0300, Avi Kivity wrote:
>
> What if the guest sends N packets, then does some expensive computation  
> (say the guest scheduler switches from the benchmark process to  
> evolution).  So now we have the marker set at packet N, but the host  
> will not see it until the guest timeslice is up?

Well that's fine.  The guest will use up the remainder of its
timeslice.  After all we only have one core/hyperthread here so
this is no different than if the packets were held up higher up
in the guest kernel and the guest decided to do some computation.

Once its timeslice completes the backend can start plugging away
at the backlog.

Of course it would be better to put the backend on another core
that shares the cache or a hyperthread on the same core.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Anthony Liguori wrote:
I don't think we even need that to end this debate.  I'm convinced we 
have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This 
defies logic so I'm now looking to isolate why that is.


I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
were the big winner... I hate qemu sometimes.





What, this:


diff --git a/qemu/exec.c b/qemu/exec.c
index 67f3fa3..1331022 100644
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
 unsigned long pd;
 PhysPageDesc *p;
 
+#if 1

+return ldl_p(phys_ram_base + addr);
+#endif
+
 p = phys_page_find(addr >> TARGET_PAGE_BITS);
 if (!p) {
 pd = IO_MEM_UNASSIGNED;
@@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
 unsigned long pd;
 PhysPageDesc *p;
 
+#if 1

+return ldq_p(phys_ram_base + addr);
+#endif
+
 p = phys_page_find(addr >> TARGET_PAGE_BITS);
 if (!p) {
 pd = IO_MEM_UNASSIGNED;


The way I read it, it will run only run slowly once per page, then 
settle to a cache miss per page.


Regardless, it makes a memslot model even more attractive.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Anthony Liguori

Anthony Liguori wrote:

Avi Kivity wrote:

Avi Kivity wrote:


The alternative is to get a notification from the stack that the 
packet is done processing.  Either an skb destructor in the kernel, 
or my new API that everyone is not rushing out to implement.


btw, my new api is


  io_submit(..., nr, ...): submit nr packets
  io_getevents(): complete nr packets


I don't think we even need that to end this debate.  I'm convinced we 
have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This 
defies logic so I'm now looking to isolate why that is.


I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes were 
the big winner... I hate qemu sometimes.


I'm pretty confident I can get at least to Greg's numbers with some 
poking.  I think I understand why he's doing better after reading his 
patches carefully but I also don't think it'll scale with many guests 
well...  stay tuned.


But most importantly, we are darn near where vbus is with this patch wrt 
added packet latency and this is totally from userspace with no host 
kernel changes.


So no, userspace is not the issue.

Regards,

Anthony Liguori


Regards,

Anthony Liguori



diff --git a/qemu/exec.c b/qemu/exec.c
index 67f3fa3..1331022 100644
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
 unsigned long pd;
 PhysPageDesc *p;
 
+#if 1
+return ldl_p(phys_ram_base + addr);
+#endif
+
 p = phys_page_find(addr >> TARGET_PAGE_BITS);
 if (!p) {
 pd = IO_MEM_UNASSIGNED;
@@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
 unsigned long pd;
 PhysPageDesc *p;
 
+#if 1
+return ldq_p(phys_ram_base + addr);
+#endif
+
 p = phys_page_find(addr >> TARGET_PAGE_BITS);
 if (!p) {
 pd = IO_MEM_UNASSIGNED;
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 9bce3a0..ac77b80 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -36,6 +36,7 @@ typedef struct VirtIONet
 VirtQueue *ctrl_vq;
 VLANClientState *vc;
 QEMUTimer *tx_timer;
+QEMUBH *bh;
 int tx_timer_active;
 int mergeable_rx_bufs;
 int promisc;
@@ -504,6 +505,10 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 virtio_notify(&n->vdev, n->rx_vq);
 }
 
+VirtIODevice *global_vdev = NULL;
+
+extern void tap_try_to_recv(VLANClientState *vc);
+
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -545,42 +550,35 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 len += hdr_len;
 }
 
+global_vdev = &n->vdev;
 len += qemu_sendv_packet(n->vc, out_sg, out_num);
+global_vdev = NULL;
 
 virtqueue_push(vq, &elem, len);
 virtio_notify(&n->vdev, vq);
 }
+
+tap_try_to_recv(n->vc->vlan->first_client);
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
 VirtIONet *n = to_virtio_net(vdev);
 
-if (n->tx_timer_active) {
-virtio_queue_set_notification(vq, 1);
-qemu_del_timer(n->tx_timer);
-n->tx_timer_active = 0;
-virtio_net_flush_tx(n, vq);
-} else {
-qemu_mod_timer(n->tx_timer,
-   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-n->tx_timer_active = 1;
-virtio_queue_set_notification(vq, 0);
-}
+#if 0
+virtio_queue_set_notification(vq, 0);
+qemu_bh_schedule(n->bh);
+#else
+virtio_net_flush_tx(n, n->tx_vq);
+#endif
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_handle_tx_bh(void *opaque)
 {
 VirtIONet *n = opaque;
 
-n->tx_timer_active = 0;
-
-/* Just in case the driver is not ready on more */
-if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-return;
-
-virtio_queue_set_notification(n->tx_vq, 1);
 virtio_net_flush_tx(n, n->tx_vq);
+virtio_queue_set_notification(n->tx_vq, 1);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -675,8 +673,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
 n->vdev.get_features = virtio_net_get_features;
 n->vdev.set_features = virtio_net_set_features;
 n->vdev.reset = virtio_net_reset;
-n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
-n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx);
+n->tx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_tx);
 n->ctrl_vq = virtio_add_queue(&n->vdev, 16, virtio_net_handle_ctrl);
 memcpy(n->mac, nd->macaddr, ETH_ALEN);
 n->status = VIRTIO_NET_S_LINK_UP;
@@ -684,10 +682,10 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
  virtio_net_receive, virtio_net_can_receive, n);
 n->vc->link_status_changed = virtio_net_set_link_status;
 
+n->bh = qemu_bh_new(virti

Re: [RFC PATCH 02/17] vbus: add virtual-bus definitions

2009-04-02 Thread Ben Hutchings
On Tue, 2009-03-31 at 14:42 -0400, Gregory Haskins wrote:
[...]
> +Create a device instance
> +
> +
> +Devices are instantiated by again utilizing the /config/vbus configfs area.
> +At first you may suspect that devices are created as subordinate objects of a
> +bus/container instance, but you would be mistaken.

This is kind of patronising; why don't you simply lay out how things
_do_ work?

>  Devices are actually
> +root-level objects in vbus specifically to allow greater flexibility in the
> +association of a device.  For instance, it may be desirable to have a single
> +device that spans multiple VMs (consider an ethernet switch, or a shared disk
> +for a cluster).  Therefore, device lifecycles are managed by 
> creating/deleting
> +objects in /config/vbus/devices.
> +
> +Note: Creating a device instance is actually a two step process:  We need to
> +give the device instance a unique name, and we also need to give it a 
> specific
> +device type.  It is hard to express both parameters using standard filesystem
> +operations like mkdir, so the design decision was made to require performing
> +the operation in two steps.

How about exposing a subdir for each device class under
/config/vbus/devices/ and allowing device creation only within those?
Two-stage construction is a pain for both users and implementors.

[...]
> +At this point, we are ready to roll.  Pid 4382 has access to a virtual-bus
> +namespace with one device, id=0.  Its type is:
> +
> +# cat /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
> +virtual-ethernet
> +
> +"virtual-ethernet"?  Why is it not "venet-tap"?  Device-classes are allowed 
> to
> +register their interfaces under an id that is not required to be the same as
> +their deviceclass.  This supports device polymorphism.   For instance,
> +consider that an interface "virtual-ethernet" may provide basic 802.x packet
> +exchange.  However, we could have various implementations of a device that
> +supports the 802.x interface, while having various implementations behind
> +them.
[...]

It seems to me that your "device-classes" correspond to drivers and
"interfaces" correspond to device classes in the LDM.  To avoid
confusion, I think the vbus terminology should be made consistent with
LDM.  And certainly these should not both be called simply "type" in the
configfs/sysfs interface.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio_net: MAC address releated breakage if there is no MAC area in config

2009-04-02 Thread Alex Williamson
On Thu, 2009-04-02 at 13:33 +0200, Christian Borntraeger wrote:
> I read this as the mac config field is optional (similar to all the optional
> fields we added in virtio_blk later).
> 
> I see two options:
> 1. Change our sample userspace to always allocate the config (like lguest and
> qemu)
> 2. Change the kernel code to not write into the config unless a specific 
> feature
> bit is set. (e.g. VIRTIO_NET_F_SETMAC)
> 
> 
> Opinions?

Hi Christian,

Sorry for the breakage.  My interpretation of the virtio-net config
space was that the mac field was always present and the host had
programmed a valid value when the F_MAC feature is available.  However,
from the history of the flag, it seems like you're interpretation is
likely correct.  Setting the config value from the randomly generated
mac was largely opportunistic since there's no userspace that doesn't
provide a mac by default.  So perhaps we can drop that and gate the
set_mac_address entry point as shown below.  How does this look?
Thanks,

Alex


virtio_net: Set the mac config only when VIRITO_NET_F_MAC

VIRTIO_NET_F_MAC indicates the presence of the mac field in config
space, not the validity of the value it contains.  Allow the mac to be
changed at runtime, but only push the change into config space with the
VIRTIO_NET_F_MAC feature present.

Signed-off-by: Alex Williamson 
--

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a6f1e19..9c82a39 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -575,8 +575,9 @@ static int virtnet_set_mac_address(struct net_device *dev, 
void *p)
if (ret)
return ret;
 
-   vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
- dev->dev_addr, dev->addr_len);
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC))
+   vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
+ dev->dev_addr, dev->addr_len);
 
return 0;
 }
@@ -876,11 +877,8 @@ static int virtnet_probe(struct virtio_device *vdev)
vdev->config->get(vdev,
  offsetof(struct virtio_net_config, mac),
  dev->dev_addr, dev->addr_len);
-   } else {
+   } else
random_ether_addr(dev->dev_addr);
-   vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
- dev->dev_addr, dev->addr_len);
-   }
 
/* Set up our device-specific information */
vi = netdev_priv(dev);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Izik Eidus

Chris Wright wrote:

* Izik Eidus (iei...@redhat.com) wrote:
  

Is this what we want?



How about baby steps...

admit that ioctl to control plane is better done via sysfs?
  

Yes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-02 Thread Avi Kivity

Avi Kivity wrote:
 


It doesn't had to do it.  The PCI transaction will automatically 
invalidate

caches - but qemu doesn't emulate this (and doesn't need to do on x86).
  


So any DMA on ia64 will flush the instruction caches?!



Or maybe, the host kernel will do it after the transaction completes?  
In our case the lack of zero-copy means the host is invalidating the 
wrong addresses (memcpy source) and leaving the real destination intact.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Chris Wright
* Izik Eidus (iei...@redhat.com) wrote:
> Is this what we want?

How about baby steps...

admit that ioctl to control plane is better done via sysfs?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Herbert Xu wrote:

On Thu, Apr 02, 2009 at 06:00:17PM +0300, Avi Kivity wrote:
  
Good point - if we rely on having excess cores in the host, large guest  
scalability will drop.



Going back to TX mitigation, I wonder if we could avoid it altogether
by having a "wakeup" mechanism that does not involve a vmexit.  We
have two cases:

1) UP, or rather guest runs on the same core/hyperthread as the
backend.  This is the easy one, the guest simply sets a marker
in shared memory and keeps going until its time is up.  Then the
backend takes over, and uses a marker for notification too.

The markers need to be interpreted by the scheduler so that it
knows the guest/backend is runnable, respectively.
  


Let's look at this first.

What if the guest sends N packets, then does some expensive computation 
(say the guest scheduler switches from the benchmark process to 
evolution).  So now we have the marker set at packet N, but the host 
will not see it until the guest timeslice is up?


I think I totally misunderstood you.  Can you repeat in smaller words?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-02 Thread Avi Kivity

tging...@free.fr wrote:
  

What about smp?



fc will broadcast to the coherence domain the cache invalidation.  So it is
SMP-ready for usual machines.

  


Interesting.


I'm surprised the guest doesn't do this by itself?



It doesn't had to do it.  The PCI transaction will automatically invalidate
caches - but qemu doesn't emulate this (and doesn't need to do on x86).
  


So any DMA on ia64 will flush the instruction caches?!

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip 0/4 V3] tracing: kprobe-based event tracer

2009-04-02 Thread Masami Hiramatsu
Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 07:21:55PM -0400, Masami Hiramatsu wrote:
>> Andi Kleen wrote:
>>> On Wed, Apr 01, 2009 at 04:51:00PM -0400, Masami Hiramatsu wrote:
 Andi Kleen wrote:
> Masami Hiramatsu  writes:
>> I agreed. Fortunately, Jim Keniston and I wrote an x86 instruction
>> decoder :-) which has been made originally for uprobe andd kprobes
>> jump-optimizer.
>>
>> https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html
> An alternative would be to adapt the x86 interpreter in KVM.
> I thought for some time that that one should be available in 
> a more generic form in a library.
 As far as I can see, KVM's instruction emulator is incomplete
>>> That's fine for you -- you only care about a subset of instructions
>>> anyways, don't you?
>> Actually, (in my case) I just need to decode non-FPU instructions,
> 
> What does it have to do with the FPU?  I don't think the KVM
> one is aimed at those either.

Nothing, at least in kernel :). However, as I said before,
uprobe developers want to use this decoder for decoding
FPU instructions. Fortunately, this decoder can cover
those instructions too.

>> because I'd like to check whether kprobe is on the instruction
>> boundary.
>>
>> However, KVM's insn decoder can't decode some elemental
>> instructions, and instruction flags are incorrect.
> 
> What flags?  EFLAGS? 

No, KVM's decoder has instruction classification flags for
each instructions, and some of those flags are not correct.

>> I had written instruction decoder based on it, but the result
>> was so awful!
> 
> What were the problems?

It couldn't decode kernel binary correctly and found many bugs...

https://www.redhat.com/archives/utrace-devel/2009-March/msg00013.html

On the other hand, this decoder already verified that the result
is same as objdump's output.

https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html


> Did you report the problems to the KVM maintainers?

No, sorry, because I wrote a patch just referring KVM decoder.
I didn't use KVM decoder code itself.
I guess KVM uses their decoder only for emulating a
limited number of instructions. In that case, it will be OK for KVM.


> I still think it would be better to have a single good
> decoder than a multitude of different ones tailored to specific
> cases. 

Sure, why not? I agreed we'd better have a single decoder in the end.
However, I think KVM decoder is too big and complex (and tailored?)
to start with...
So, IMHO, we'd better have a "transition period" to clarify
demands from user components, to discuss how we can integrate it.

>> So soon, I had to rewrite it based on Intel's manual entirely :-(
> 
> Ok then perhaps KVM could benefit from your work too?

If their purpose is covering all instructions, Yes.

>>> do nothing. I looked at it some time ago for doing instruction
>>> length checking for some application, but that application
>>> then disappeared. The main obstacle with making it a library 
>>> is that some KVM specific dependencies have crept in that would
>>> need to be abstracted again, but I don't think it would need a lot of 
>>> effort,
>> Sorry, but I don't think so. Current KVM's decoder is much more
>> focusing on preparing instructions emulation. It requires
>> vcpu setup, fetching operators and so on. I think it needs to
>> diet their code (or well splitting from emulator).
> 
> the vcpu stuff can be all dummies. If you look at the original
> Xen version of it before it forked it was better isolated there.
> The other stuff that crept in in the KVM version could be also
> fixed.
> 
> 
>> Anyway, I don't stick with my decoder. If they can provide more
>> generic interfaces, I'd be happy to use it. :-)
> 
> I suspect "they" would need some help.

Sure, I agreed.

KVM developers, I'll cross-post our x86 instruction decoder to
KVM-ML. If you are interested in, please comment on it :)

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhira...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

vbus (if I understand it right) is a whole package of things:

- a way to enumerate, discover, and manage devices



Yes
  

That part duplicates PCI



Yes, but the important thing to point out is it doesn't *replace* PCI. 
It simply an alternative.
  


Does it offer substantial benefits over PCI?  If not, it's just extra code.

Note that virtio is not tied to PCI, so "vbus is generic" doesn't count.


and it would be pretty hard to convince me we need to move to
something new



But thats just it.  You don't *need* to move.  The two can coexist side
by side peacefully.  "vbus" just ends up being another device that may
or may not be present, and that may or may not have devices on it.  In
fact, during all this testing I was booting my guest with "eth0" as
virtio-net, and "eth1" as venet.  The both worked totally fine and
harmoniously.  The guest simply discovers if vbus is supported via a
cpuid feature bit and dynamically adds it if present.
  


I meant, move the development effort, testing, installed base, Windows 
drivers.


  

.  virtio-pci (a) works,


And it will continue to work
  


So why add something new?

  

(b) works on Windows.



virtio will continue to work on windows, as well.  And if one of my
customers wants vbus support on windows and is willing to pay us to
develop it, we can support *it* there as well.
  


I don't want to develop and support both virtio and vbus.  And I 
certainly don't want to depend on your customers.



- a different way of doing interrupts


Yeah, but this is ok.  And I am not against doing that mod we talked
about earlier where I replace dynirq with a pci shim to represent the
vbus.  Question about that: does userspace support emulation of MSI
interrupts?  


Yes, this is new.  See the interrupt routing stuff I mentioned.  It's 
probably only in kvm.git, not even in 2.6.30.



I would probably prefer it if I could keep the vbus IRQ (or
IRQs when I support MQ) from being shared.  It seems registering the
vbus as an MSI device would be more conducive to avoiding this.
  


I still think you want one MSI per device rather than one MSI per vbus, 
to avoid scaling problems on large guest.  After Herbert's let loose on 
the code, one MSI per queue.




- a different ring layout, and splitting notifications from the ring


Again, virtio will continue to work.  And if we cannot find a way to
collapse virtio and ioq together in a way that everyone agrees on, there
is no harm in having two.  I have no problem saying I will maintain
IOQ.  There is plenty of precedent for multiple ways to do the same thing.
  


IMO we should just steal whatever makes ioq better, and credit you in 
some file no one reads.  We get backwards compatibility, Windows 
support, continuity, etc.



I don't see the huge win here

- placing the host part in the host kernel

Nothing vbus-specific here.



Well, it depends on what you want.  Do you want a implementation that is
virtio-net, kvm, and pci specific while being hardcoded in?


No.  virtio is already not kvm or pci specific.  Definitely all the pci 
emulation parts will remain in user space.



  What
happens when someone wants to access it but doesnt support pci?  What if
something like lguest wants to use it too?  What if you want
virtio-block next?  This is one extreme.
  


It works out well on the guest side, so it can work on the host side.  
We have virtio bindings for pci, s390, and of course lguest.  virtio 
itself is agnostic to all of these.  The main difference from vbus is 
that it's guest-only, but could easily be extended to the host side if 
we break down and do things in the kernel.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-02 Thread tgingold
Quoting Avi Kivity :

> Zhang, Yang wrote:
> > The data from dma will include instructions. In order to exeuting the right
> > instruction, we should to flush the i-cache to ensure those data can be see
> > by cpu.
> >
> >
> >
> > diff --git a/qemu/cache-utils.h b/qemu/cache-utils.h
> > index b45fde4..5e11d12 100644
> > --- a/qemu/cache-utils.h
> > +++ b/qemu/cache-utils.h
> > @@ -33,8 +33,22 @@ static inline void flush_icache_range(unsigned long
> start, unsigned long stop)
> >  asm volatile ("sync" : : : "memory");
> >  asm volatile ("isync" : : : "memory");
> >  }
> > +#define qemu_sync_idcache flush_icache_range
> > +#else
> >
> > +#ifdef __ia64__
> > +static inline void qemu_sync_idcache(unsigned long start, unsigned long
> stop)
> > +{
> > +while (start < stop) {
> > +   asm volatile ("fc %0" :: "r"(start));
> > +   start += 32;
> > +}
> > +asm volatile (";;sync.i;;srlz.i;;");
> > +}
> >

As I hit the same issue a year ago, here is my understanding:

> What about smp?

fc will broadcast to the coherence domain the cache invalidation.  So it is
SMP-ready for usual machines.

> I'm surprised the guest doesn't do this by itself?

It doesn't had to do it.  The PCI transaction will automatically invalidate
caches - but qemu doesn't emulate this (and doesn't need to do on x86).

Tristan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 06:00:17PM +0300, Avi Kivity wrote:
>
> Good point - if we rely on having excess cores in the host, large guest  
> scalability will drop.

Going back to TX mitigation, I wonder if we could avoid it altogether
by having a "wakeup" mechanism that does not involve a vmexit.  We
have two cases:

1) UP, or rather guest runs on the same core/hyperthread as the
backend.  This is the easy one, the guest simply sets a marker
in shared memory and keeps going until its time is up.  Then the
backend takes over, and uses a marker for notification too.

The markers need to be interpreted by the scheduler so that it
knows the guest/backend is runnable, respectively.

2) The guest and backend runs on two cores/hyperthreads.  We'll
assume that they share caches as otherwise mitigation is the last
thing to worry about.  We use the same marker mechanism as above.
The only caveat is that if one core/hyperthread is idle, its
idle thread needs to monitor the marker (this would be a separate
per-core marker) to wake up the scheduler.

CCing Ingo so that he can flame me if I'm totally off the mark.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Patrick Mullaney wrote:
>> On Thu, 2009-04-02 at 16:27 +0300, Avi Kivity wrote:
>>
>>  
>>> virtio is a stable ABI.
>>>
>>>
 However, theres still the possibility we can make this work in an ABI
 friendly way with cap-bits, or other such features.  For instance, the
 virtio-net driver could register both with pci and vbus-proxy and
 instantiate a device with a slightly different ops structure for
 each or
 something.  Alternatively we could write a host-side shim to expose
 vbus
 devices as pci devices or something like that.
 
>>> Sounds complicated...
>>>
>>> 
>>
>> IMO, it doesn't sound anymore complicated than making virtio support the
>> concepts already provided by vbus/venet-tap driver. Isn't there already
>> precedent for alternative approaches co-existing and having the users
>> decide which is the most appropriate for their use case? Switching
>> drivers in order to improve latency for a certain class of applications
>> would seem like something latency sensitive users would be more than
>> willing to do. I'd like to point out 2 things. Greg has offered help
>> in moving virtio into the vbus infrastructure. The vbus infrastructure
>> is a large part of what is being proposed here.
>>   
>
> vbus (if I understand it right) is a whole package of things:
>
> - a way to enumerate, discover, and manage devices

Yes
>
> That part duplicates PCI

Yes, but the important thing to point out is it doesn't *replace* PCI. 
It simply an alternative.

> and it would be pretty hard to convince me we need to move to
> something new

But thats just it.  You don't *need* to move.  The two can coexist side
by side peacefully.  "vbus" just ends up being another device that may
or may not be present, and that may or may not have devices on it.  In
fact, during all this testing I was booting my guest with "eth0" as
virtio-net, and "eth1" as venet.  The both worked totally fine and
harmoniously.  The guest simply discovers if vbus is supported via a
cpuid feature bit and dynamically adds it if present.

> .  virtio-pci (a) works,
And it will continue to work

> (b) works on Windows.

virtio will continue to work on windows, as well.  And if one of my
customers wants vbus support on windows and is willing to pay us to
develop it, we can support *it* there as well.
>
>
> - a different way of doing interrupts
Yeah, but this is ok.  And I am not against doing that mod we talked
about earlier where I replace dynirq with a pci shim to represent the
vbus.  Question about that: does userspace support emulation of MSI
interrupts?  I would probably prefer it if I could keep the vbus IRQ (or
IRQs when I support MQ) from being shared.  It seems registering the
vbus as an MSI device would be more conducive to avoiding this.

>
> Again, the need to paravirtualize kills this on Windows (I think).
Not really.  Its the same thing conceptually as virtio, except I am not
riding on PCI so I would need to manage this somehow.  Its support would
not be "free", but I dont think the ability to support this new bus type
is ultimately predicated on having PCI support.  But like I said, this
is really vbus's problem.  virtio will continue to work, and customer
funding (or a dev volunteer) will dictate if windows can support vbus as
well.  Right now I am perfectly willing to accept that windows guests
have no ability to access the feature.

>
> - a different ring layout, and splitting notifications from the ring
Again, virtio will continue to work.  And if we cannot find a way to
collapse virtio and ioq together in a way that everyone agrees on, there
is no harm in having two.  I have no problem saying I will maintain
IOQ.  There is plenty of precedent for multiple ways to do the same thing.

>
>
> I don't see the huge win here
>
> - placing the host part in the host kernel
>
> Nothing vbus-specific here.

Well, it depends on what you want.  Do you want a implementation that is
virtio-net, kvm, and pci specific while being hardcoded in?  What
happens when someone wants to access it but doesnt support pci?  What if
something like lguest wants to use it too?  What if you want
virtio-block next?  This is one extreme.

The other extreme is the direction I have gone, which is dynamically
loaded/instantiated generic objects which can work with kvm or whatever
subsystem wants to write a vbus-connector for.  I realize this is more
complex.  It is also more flexible.  Everything has a cost, though I
will point out that a good portion of the cost has already been paid for
by me and my employer ;)

So yeah, it doesn't *need* vbus to do this.  This is just one way of
many things that could be done between the two extremes.  But I didn't
design this thing to be some randomly coded amorphous blob that I am now
trying to miraculously shoehorn into KVM.  I designed it from the start
as what I felt a good virtual IO facility could be when starting with a
clean slate, keeping KVM as a primary t

Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Andrea Arcangeli
On Thu, Apr 02, 2009 at 08:12:51AM -0700, Chris Wright wrote:
> less regardless of the vma).  To do it purely at the vma level would
> mean a vma unmap would cause the watch to go away.  So, question is...do

But madvise effects must go away at munmap/mmap-overwrite...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Chris Wright
* Izik Eidus (iei...@redhat.com) wrote:
> So if we move into madvice and we remove the get_task_mm() usage, we  
> will have to add notification to exit_mm() so ksm will know it should  
> stop using this mm strcture, and drop it from all the trees data...

Technically it's needed already.  This example is currently semi-broken:

main()
 ksm_register_memory
 execve()   <-- no notifiction unless fd is proactively marked cloexec
(which it isn't)

   new proc...do stuff (it's ->mm isn't registered)
   eventually exit() <-- close fd and clear up the old stale ->mm registered
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Chris Wright
* Andrea Arcangeli (aarca...@redhat.com) wrote:
> On Wed, Apr 01, 2009 at 10:31:14PM -0700, Chris Wright wrote:
> >   - register only ATM, can add MADV_UNSHAREABLE to allow an app to 
> > proactively
> > unregister, but need a cleanup when ->mm goes away via exit/exec
> 
> The unregister cleanup must happen at the vma level (with unregister
> when vma goes away or is overwritten) for this to provide sane madvise
> semantics (not just in exit/exec, but in unmap/mmap too). Otherwise
> this is all but madvise. Basically we need a chunk of code in core VM
> when KSM=y/m, even if we keep returning -EINVAL when KSM=n (for
> backwards compatibility, -ENOSYS not). Example, vma must be split in
> two if you MAP_SHARABLE only part of it etc...

Yes, of course.  I mentioned that (push whole thing into vma).
Current api is really at ->mm level, it's vma agnostic.  Simply put:
watch for pages in this ->mm between start and start+len and (more or
less regardless of the vma).  To do it purely at the vma level would
mean a vma unmap would cause the watch to go away.  So, question is...do
we need something in ->mm as well (like mlockall)?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Michael S. Tsirkin
On Thu, Apr 02, 2009 at 10:43:19PM +1030, Rusty Russell wrote:
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> > You do not need to know when the packet is copied (which I currently
> > do).  You only need it for zero-copy (of which I would like to support,
> > but as I understand it there are problems with the reliability of proper
> > callback (i.e. skb->destructor).
> 
> But if you have a UP guest, there will *never* be another packet in the queue
> at this point, since it wasn't running.
> 
> As Avi said, you can do the processing in another thread and go back to the
> guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
> again before the thread did for exactly this kind of reason.
> 
> While Avi's point about a "powerful enough userspace API" is probably valid,
> I don't think it's going to happen.  It's almost certainly less code to put a
> virtio_net server in the kernel, than it is to create such a powerful
> interface (see vringfd & tap).  And that interface would have one user in
> practice.
> 
> So, let's roll out a kernel virtio_net server.  Anyone?
> Rusty.

BTW, whatever approach is chosen, to enable zero-copy transmits, it seems that
we still must add tracking of when the skb has actually been transmitted, right?

Rusty, I think this is what you did in your patch from 2008 to add destructor
for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 
):
and it seems that it would make zero-copy possible - or was there some problem 
with
that approach? Do you happen to remember?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Herbert Xu wrote:

On Thu, Apr 02, 2009 at 04:07:09PM +0300, Avi Kivity wrote:
  

I think Rusty did mean a UP guest, and without schedule-and-forget.



Going off on a tangent here, I don't really think it should matter
whether we're UP or SMP.  The ideal state is where we have the
same number of (virtual) TX queues as there are cores in the guest.
On the host side we need the backend to run at least on a core
that shares cache with the corresponding guest queue/core.  If
that happens to be the same core as the guest core then it should
work as well.

IOW we should optimise it as if the host were UP.
  


Good point - if we rely on having excess cores in the host, large guest 
scalability will drop.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 04:07:09PM +0300, Avi Kivity wrote:
>
> I think Rusty did mean a UP guest, and without schedule-and-forget.

Going off on a tangent here, I don't really think it should matter
whether we're UP or SMP.  The ideal state is where we have the
same number of (virtual) TX queues as there are cores in the guest.
On the host side we need the backend to run at least on a core
that shares cache with the corresponding guest queue/core.  If
that happens to be the same core as the guest core then it should
work as well.

IOW we should optimise it as if the host were UP.

> The problem is that we already have virtio guest drivers going several  
> kernel versions back, as well as Windows drivers.  We can't keep  
> changing the infrastructure under people's feet.

Yes I agree that changing the guest-side driver is a no-no.  However,
we should be able to achieve what's shown here without modifying the
guest-side.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Anthony Liguori

Avi Kivity wrote:

Avi Kivity wrote:


The alternative is to get a notification from the stack that the 
packet is done processing.  Either an skb destructor in the kernel, 
or my new API that everyone is not rushing out to implement.


btw, my new api is


  io_submit(..., nr, ...): submit nr packets
  io_getevents(): complete nr packets


I don't think we even need that to end this debate.  I'm convinced we 
have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This defies 
logic so I'm now looking to isolate why that is.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Avi Kivity wrote:


The alternative is to get a notification from the stack that the 
packet is done processing.  Either an skb destructor in the kernel, or 
my new API that everyone is not rushing out to implement.


btw, my new api is


  io_submit(..., nr, ...): submit nr packets
  io_getevents(): complete nr packets

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Andrea Arcangeli
On Wed, Apr 01, 2009 at 10:31:14PM -0700, Chris Wright wrote:
>   - register only ATM, can add MADV_UNSHAREABLE to allow an app to proactively
> unregister, but need a cleanup when ->mm goes away via exit/exec

The unregister cleanup must happen at the vma level (with unregister
when vma goes away or is overwritten) for this to provide sane madvise
semantics (not just in exit/exec, but in unmap/mmap too). Otherwise
this is all but madvise. Basically we need a chunk of code in core VM
when KSM=y/m, even if we keep returning -EINVAL when KSM=n (for
backwards compatibility, -ENOSYS not). Example, vma must be split in
two if you MAP_SHARABLE only part of it etc...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Patrick Mullaney
On Thu, 2009-04-02 at 16:27 +0300, Avi Kivity wrote:

> 
> virtio is a stable ABI.
> 
> > However, theres still the possibility we can make this work in an ABI
> > friendly way with cap-bits, or other such features.  For instance, the
> > virtio-net driver could register both with pci and vbus-proxy and
> > instantiate a device with a slightly different ops structure for each or
> > something.  Alternatively we could write a host-side shim to expose vbus
> > devices as pci devices or something like that.
> >   
> 
> Sounds complicated...
> 

IMO, it doesn't sound anymore complicated than making virtio support the
concepts already provided by vbus/venet-tap driver. Isn't there already
precedent for alternative approaches co-existing and having the users
decide which is the most appropriate for their use case? Switching
drivers in order to improve latency for a certain class of applications
would seem like something latency sensitive users would be more than
willing to do. I'd like to point out 2 things. Greg has offered help
in moving virtio into the vbus infrastructure. The vbus infrastructure
is a large part of what is being proposed here.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

If you have a request-response workload with the wire idle and latency
critical, then there's no problem having an exit per packet because
(a) there aren't that many packets and (b) the guest isn't doing any
batching, so guest overhead will swamp the hypervisor overhead.


Right, so the trick is to use an algorithm that adapts here.  Batching
solves the first case, but not the second.  The bidir napi thing solves
both, but it does assume you have ample host processing power to run the
algorithm concurrently.  This may or may not be suitable to all
applications, I admit.
  


The alternative is to get a notification from the stack that the packet 
is done processing.  Either an skb destructor in the kernel, or my new 
API that everyone is not rushing out to implement.



Right now its way way way worse than 2us.  In fact, at my last reading
this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
maintaining line-rate) and I will be impressed.  Heck, shorten it to
80us and I will be impressed.
  
  

The 3060us thing is a timer, not cpu time.

Agreed, but its still "state of the art" from an observer perspective. 
The reason "why", though easily explainable, is inconsequential to most

people.  FWIW, I have seen virtio-net do a much more respectable 350us
on an older version, so I know there is plenty of room for improvement.
  


All I want is the notification, and the timer is headed into the nearest 
landfill.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Patrick Mullaney wrote:

On Thu, 2009-04-02 at 16:27 +0300, Avi Kivity wrote:

  

virtio is a stable ABI.



However, theres still the possibility we can make this work in an ABI
friendly way with cap-bits, or other such features.  For instance, the
virtio-net driver could register both with pci and vbus-proxy and
instantiate a device with a slightly different ops structure for each or
something.  Alternatively we could write a host-side shim to expose vbus
devices as pci devices or something like that.
  
  

Sounds complicated...




IMO, it doesn't sound anymore complicated than making virtio support the
concepts already provided by vbus/venet-tap driver. Isn't there already
precedent for alternative approaches co-existing and having the users
decide which is the most appropriate for their use case? Switching
drivers in order to improve latency for a certain class of applications
would seem like something latency sensitive users would be more than
willing to do. I'd like to point out 2 things. Greg has offered help
in moving virtio into the vbus infrastructure. The vbus infrastructure
is a large part of what is being proposed here.
  


vbus (if I understand it right) is a whole package of things:

- a way to enumerate, discover, and manage devices

That part duplicates PCI and it would be pretty hard to convince me we 
need to move to something new.  virtio-pci (a) works, (b) works on Windows.


- a different way of doing interrupts

Again, the need to paravirtualize kills this on Windows (I think).

- a different ring layout, and splitting notifications from the ring

I don't see the huge win here

- placing the host part in the host kernel

Nothing vbus-specific here.

Switching drivers is unfortunately not easy on Linux as you need a new 
kernel; it's easier on Windows once you have the drivers written.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>
 Avi Kivity wrote:
  
  
> My 'prohibitively expensive' is true only if you exit every packet.
>
>
> 
 Understood, but yet you need to do this if you want something like
 iSCSI
 READ transactions to have as low-latency as possible.
 
>>> Dunno, two microseconds is too much?  The wire imposes much more.
>>>
>>> 
>>
>> No, but thats not what we are talking about.  You said signaling on
>> every packet is prohibitively expensive.  I am saying signaling on every
>> packet is required for decent latency.  So is it prohibitively expensive
>> or not?
>>   
>
> We're heading dangerously into the word-game area.  Let's not do that.
>
> If you have a high throughput workload with many packets per seconds
> then an exit per packet (whether to userspace or to the kernel) is
> expensive.  So you do exit mitigation.  Latency is not important since
> the packets are going to sit in the output queue anyway.

Agreed.  virtio-net currently does this with batching.  I do with the
bidir napi thing (which effectively crosses the producer::consumer > 1
threshold to mitigate the signal path).


>
> If you have a request-response workload with the wire idle and latency
> critical, then there's no problem having an exit per packet because
> (a) there aren't that many packets and (b) the guest isn't doing any
> batching, so guest overhead will swamp the hypervisor overhead.
Right, so the trick is to use an algorithm that adapts here.  Batching
solves the first case, but not the second.  The bidir napi thing solves
both, but it does assume you have ample host processing power to run the
algorithm concurrently.  This may or may not be suitable to all
applications, I admit.

>
> If you have a low latency request-response workload mixed with a high
> throughput workload, then you aren't going to get low latency since
> your low latency packets will sit on the queue behind the high
> throughput packets.  You can fix that with multiqueue and then you're
> back to one of the scenarios above.
Agreed, and thats ok.  Now we are getting more into 802.1p type MQ
issues anyway, if the application cared about it that much.

>
>> I think most would agree that adding 2us is not bad, but so far that is
>> an unproven theory that the IO path in question only adds 2us.   And we
>> are not just looking at the rate at which we can enter and exit the
>> guest...we need the whole path...from the PIO kick to the dev_xmit() on
>> the egress hardware, to the ingress and rx-injection.  This includes any
>> and all penalties associated with the path, even if they are imposed by
>> something like the design of tun-tap.
>>   
>
> Correct, we need to look at the whole path.  That's why the wishing
> well is clogged with my 'give me a better userspace interface' emails.
>
>> Right now its way way way worse than 2us.  In fact, at my last reading
>> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
>> maintaining line-rate) and I will be impressed.  Heck, shorten it to
>> 80us and I will be impressed.
>>   
>
> The 3060us thing is a timer, not cpu time.
Agreed, but its still "state of the art" from an observer perspective. 
The reason "why", though easily explainable, is inconsequential to most
people.  FWIW, I have seen virtio-net do a much more respectable 350us
on an older version, so I know there is plenty of room for improvement.

>   We aren't starting a JVM for each packet.
Heh...it kind of feels like that right now, so hopefully some
improvement will at least be on the one thing that comes out of all this.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>
 Rusty Russell wrote:
  
  
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> 
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to
>> support,
>> but as I understand it there are problems with the reliability of
>> proper
>> callback (i.e. skb->destructor).
>> 
> But if you have a UP guest,
> 
 I assume you mean UP host ;)

 
>>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>> 
>> That doesnt make sense to me, tho.  All the testing I did was a UP
>> guest, actually.  Why would I be constrained to run without the
>> scheduling unless the host was also UP?
>>   
>
> You aren't constrained.  And your numbers show it works.
>
>>>
>>> The problem is that we already have virtio guest drivers going several
>>> kernel versions back, as well as Windows drivers.  We can't keep
>>> changing the infrastructure under people's feet.
>>> 
>>
>> Well, IIUC the virtio code itself declares the ABI as unstable, so there
>> technically *is* an out if we really wanted one.  But I certainly
>> understand the desire to not change this ABI if at all possible, and
>> thus the resistance here.
>>   
>
> virtio is a stable ABI.

Dang!  Scratch that.
>
>> However, theres still the possibility we can make this work in an ABI
>> friendly way with cap-bits, or other such features.  For instance, the
>> virtio-net driver could register both with pci and vbus-proxy and
>> instantiate a device with a slightly different ops structure for each or
>> something.  Alternatively we could write a host-side shim to expose vbus
>> devices as pci devices or something like that.
>>   
>
> Sounds complicated...

Well, the first solution would be relatively trivial...at least on the
guest side.  All the other infrastructure is done and included in the
series I sent out.  The changes to the virtio-net driver on the guest
itself would be minimal.  The bigger effort would be converting
venet-tap to use virtio-ring instead of IOQ.  But this would arguably be
less work than starting a virtio-net backend module from scratch because
you would have to not only code up the entire virtio-net backend, but
also all the pci emulation and irq routing stuff that is required (and
is already done by the vbus infrastructure).  Here all the major pieces
are in place, just the xmit and rx routines need to be converted to
virtio-isms.

For the second option, I agree.  Its probably too nasty and it would be
better if there was just either a virtio-net to kvm-host hack, or a more
pci oriented version of a vbus-like framework.

That said, there is certainly nothing wrong with having an alternate
option.  There is plenty of precedent for having different drivers for
different subsystems, etc, even if there is overlap.  Heck, even KVM has
realtek, e1000, and virtio-net, etc.  Would our kvm community be willing
to work with me to get these patches merged?  I am perfectly willing to
maintain them.  That said, the general infrastructure should probably
not live in -kvm (perhaps -tip, -mm, or -next, etc is more
appropriate).  So a good plan might be to shoot for the core going into
a more general upstream tree.  When/if that happens, then the kvm
community could consider the kvm specific parts, etc.  I realize this is
all pending review acceptance by everyone involved...

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

Gregory Haskins wrote:


Avi Kivity wrote:
 
  

My 'prohibitively expensive' is true only if you exit every packet.





Understood, but yet you need to do this if you want something like iSCSI
READ transactions to have as low-latency as possible.
  
  

Dunno, two microseconds is too much?  The wire imposes much more.




No, but thats not what we are talking about.  You said signaling on
every packet is prohibitively expensive.  I am saying signaling on every
packet is required for decent latency.  So is it prohibitively expensive
or not?
  


We're heading dangerously into the word-game area.  Let's not do that.

If you have a high throughput workload with many packets per seconds 
then an exit per packet (whether to userspace or to the kernel) is 
expensive.  So you do exit mitigation.  Latency is not important since 
the packets are going to sit in the output queue anyway.


If you have a request-response workload with the wire idle and latency 
critical, then there's no problem having an exit per packet because (a) 
there aren't that many packets and (b) the guest isn't doing any 
batching, so guest overhead will swamp the hypervisor overhead.


If you have a low latency request-response workload mixed with a high 
throughput workload, then you aren't going to get low latency since your 
low latency packets will sit on the queue behind the high throughput 
packets.  You can fix that with multiqueue and then you're back to one 
of the scenarios above.



I think most would agree that adding 2us is not bad, but so far that is
an unproven theory that the IO path in question only adds 2us.   And we
are not just looking at the rate at which we can enter and exit the
guest...we need the whole path...from the PIO kick to the dev_xmit() on
the egress hardware, to the ingress and rx-injection.  This includes any
and all penalties associated with the path, even if they are imposed by
something like the design of tun-tap.
  


Correct, we need to look at the whole path.  That's why the wishing well 
is clogged with my 'give me a better userspace interface' emails.



Right now its way way way worse than 2us.  In fact, at my last reading
this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
maintaining line-rate) and I will be impressed.  Heck, shorten it to
80us and I will be impressed.
  


The 3060us thing is a timer, not cpu time.  We aren't starting a JVM for 
each packet.  We could remove it given a notification API, or 
duplicating the sched-and-forget thing, like Rusty did with lguest or 
Mark with qemu.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Izik Eidus

Chris Wright wrote:

* Anthony Liguori (anth...@codemonkey.ws) wrote:
  
Using an interface like madvise() would force the issue to be dealt with  
properly from the start :-)



Yeah, I'm not at all opposed to it.

This updates to madvise for register and sysfs for control.

madvise issues:
- MADV_SHAREABLE
  - register only ATM, can add MADV_UNSHAREABLE to allow an app to proactively
unregister, but need a cleanup when ->mm goes away via exit/exec
  - will register a region per vma, should probably push the whole thing
into vma rather than keep [mm,addr,len] tuple in ksm

  

The main problem that ksm will face when removing the fd interface is:
right now when you register memory into ksm, you open fd, and then ksm 
do get_task_mm(), we will do mmput when the file will be closed
(note that this doesnt mean that if you fork and not close the fd the 
memory wont go away, get_task_mm() doesnt protect the vmas inside 
the mm strcture and therefore they will be able to get removed)


So if we move into madvice and we remove the get_task_mm() usage, we 
will have to add notification to exit_mm() so ksm will know it should 
stop using this mm strcture, and drop it from all the trees data...


Is this what we want?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qemu process in Guest

2009-04-02 Thread Avi Kivity

Kumar, Venkat wrote:

1. How does Qemu process start running in Guest?
  


qemu doesn't run in the guest.  Unless you log into the guest and start 
qemu.


But I don't think that's what you were asking?


2. How does a guest's I/O request get trapped into the user mode qemu process?
  


kvm traps the I/O and returns back to userspace.

Look in libkvm/libkvm.c's handle_io() and handle_mmio().  These 
eventually call into qemu/qemu-kvm.c kvm_inb() and friends, and 
kvm_mmio_read() and kvm_mmio_write().


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>
>>>
>>> 
>>
>> Understood, but yet you need to do this if you want something like iSCSI
>> READ transactions to have as low-latency as possible.
>>   
>
> Dunno, two microseconds is too much?  The wire imposes much more.
>

No, but thats not what we are talking about.  You said signaling on
every packet is prohibitively expensive.  I am saying signaling on every
packet is required for decent latency.  So is it prohibitively expensive
or not?

I think most would agree that adding 2us is not bad, but so far that is
an unproven theory that the IO path in question only adds 2us.   And we
are not just looking at the rate at which we can enter and exit the
guest...we need the whole path...from the PIO kick to the dev_xmit() on
the egress hardware, to the ingress and rx-injection.  This includes any
and all penalties associated with the path, even if they are imposed by
something like the design of tun-tap.

Right now its way way way worse than 2us.  In fact, at my last reading
this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
maintaining line-rate) and I will be impressed.  Heck, shorten it to
80us and I will be impressed.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

Gregory Haskins wrote:


Rusty Russell wrote:
 
  

On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
 


You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to
support,
but as I understand it there are problems with the reliability of
proper
callback (i.e. skb->destructor).
  
  

But if you have a UP guest,



I assume you mean UP host ;)

  
  

I think Rusty did mean a UP guest, and without schedule-and-forget.


That doesnt make sense to me, tho.  All the testing I did was a UP
guest, actually.  Why would I be constrained to run without the
scheduling unless the host was also UP?
  


You aren't constrained.  And your numbers show it works.



The problem is that we already have virtio guest drivers going several
kernel versions back, as well as Windows drivers.  We can't keep
changing the infrastructure under people's feet.



Well, IIUC the virtio code itself declares the ABI as unstable, so there
technically *is* an out if we really wanted one.  But I certainly
understand the desire to not change this ABI if at all possible, and
thus the resistance here.
  


virtio is a stable ABI.


However, theres still the possibility we can make this work in an ABI
friendly way with cap-bits, or other such features.  For instance, the
virtio-net driver could register both with pci and vbus-proxy and
instantiate a device with a slightly different ops structure for each or
something.  Alternatively we could write a host-side shim to expose vbus
devices as pci devices or something like that.
  


Sounds complicated...

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Rusty Russell wrote:
>>  
>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>  
 You do not need to know when the packet is copied (which I currently
 do).  You only need it for zero-copy (of which I would like to
 support,
 but as I understand it there are problems with the reliability of
 proper
 callback (i.e. skb->destructor).
   
>>> But if you have a UP guest,
>>> 
>>
>> I assume you mean UP host ;)
>>
>>   
>
> I think Rusty did mean a UP guest, and without schedule-and-forget.
That doesnt make sense to me, tho.  All the testing I did was a UP
guest, actually.  Why would I be constrained to run without the
scheduling unless the host was also UP?

>
>> Hmm..well I was hoping to be able to work with you guys to make my
>> proposal fit this role.  If there is no interest in that, I hope that my
>> infrastructure itself may still be considered for merging (in *some*
>> tree, not -kvm per se) as I would prefer to not maintain it out of tree
>> if it can be avoided.
>
> The problem is that we already have virtio guest drivers going several
> kernel versions back, as well as Windows drivers.  We can't keep
> changing the infrastructure under people's feet.

Well, IIUC the virtio code itself declares the ABI as unstable, so there
technically *is* an out if we really wanted one.  But I certainly
understand the desire to not change this ABI if at all possible, and
thus the resistance here.

However, theres still the possibility we can make this work in an ABI
friendly way with cap-bits, or other such features.  For instance, the
virtio-net driver could register both with pci and vbus-proxy and
instantiate a device with a slightly different ops structure for each or
something.  Alternatively we could write a host-side shim to expose vbus
devices as pci devices or something like that.

-Greg

>
>




signature.asc
Description: OpenPGP digital signature


[PATCH] add tests for short/near Jcc and call instruction emulation

2009-04-02 Thread Gleb Natapov
Signed-off-by: Gleb Natapov 
diff --git a/user/test/x86/realmode.c b/user/test/x86/realmode.c
index f6d5326..336ba1c 100644
--- a/user/test/x86/realmode.c
+++ b/user/test/x86/realmode.c
@@ -361,14 +361,94 @@ void test_io(void)
 void test_call(void)
 {
struct regs inregs = { 0 }, outregs;
+   u32 esp[16];
+
+   inregs.esp = (u32)esp;
+
MK_INSN(call1, "mov $test_function, %eax \n\t"
   "call *%eax\n\t");
+   MK_INSN(call_near1, "jmp 2f\n\t"
+   "1: mov $0x1234, %eax\n\t"
+   "ret\n\t"
+   "2: call 1b\t");
+   MK_INSN(call_near2, "call 1f\n\t"
+   "jmp 2f\n\t"
+   "1: mov $0x1234, %eax\n\t"
+   "ret\n\t"
+   "2:\t");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_call1,
  insn_call1_end - insn_call1);
if(!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 0x1234)
print_serial("Call Test 1: FAIL\n");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_call_near1, insn_call_near1_end - insn_call_near1);
+   if(!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 0x1234)
+   print_serial("Call near Test 1: FAIL\n");
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_call_near2, insn_call_near2_end - insn_call_near2);
+   if(!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 0x1234)
+   print_serial("Call near Test 2: FAIL\n");
+}
+
+void test_jcc_short(void)
+{
+   struct regs inregs = { 0 }, outregs;
+   MK_INSN(jnz_short1, "jnz 1f\n\t"
+   "mov $0x1234, %eax\n\t"
+   "1:\n\t");
+   MK_INSN(jnz_short2, "1:\n\t"
+   "cmp $0x1234, %eax\n\t"
+   "mov $0x1234, %eax\n\t"
+   "jnz 1b\n\t");
+   MK_INSN(jmp_short1, "jmp 1f\n\t"
+ "mov $0x1234, %eax\n\t"
+ "1:\n\t");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_jnz_short1, insn_jnz_short1_end - insn_jnz_short1);
+   if(!regs_equal(&inregs, &outregs, 0))
+   print_serial("JNZ sort Test 1: FAIL\n");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_jnz_short2, insn_jnz_short2_end - insn_jnz_short2);
+   if(!regs_equal(&inregs, &outregs, R_AX) || !(outregs.eflags & (1 << 6)))
+   print_serial("JNZ sort Test 2: FAIL\n");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_jmp_short1, insn_jmp_short1_end - insn_jmp_short1);
+   if(!regs_equal(&inregs, &outregs, 0))
+   print_serial("JMP sort Test 1: FAIL\n");
+}
+
+void test_jcc_near(void)
+{
+   struct regs inregs = { 0 }, outregs;
+   /* encode near jmp manually. gas will not do it if offsets < 127 byte */
+   MK_INSN(jnz_near1, ".byte 0x0f, 0x85, 0x06, 0x00\n\t"
+  "mov $0x1234, %eax\n\t");
+   MK_INSN(jnz_near2, "cmp $0x1234, %eax\n\t"
+  "mov $0x1234, %eax\n\t"
+  ".byte 0x0f, 0x85, 0xf0, 0xff\n\t");
+   MK_INSN(jmp_near1, ".byte 0xE9, 0x06, 0x00\n\t"
+  "mov $0x1234, %eax\n\t");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_jnz_near1, insn_jnz_near1_end - insn_jnz_near1);
+   if(!regs_equal(&inregs, &outregs, 0))
+   print_serial("JNZ near Test 1: FAIL\n");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_jnz_near2, insn_jnz_near2_end - insn_jnz_near2);
+   if(!regs_equal(&inregs, &outregs, R_AX) || !(outregs.eflags & (1 << 6)))
+   print_serial("JNZ near Test 2: FAIL\n");
+
+   exec_in_big_real_mode(&inregs, &outregs,
+   insn_jmp_near1, insn_jmp_near1_end - insn_jmp_near1);
+   if(!regs_equal(&inregs, &outregs, 0))
+   print_serial("JMP near Test 1: FAIL\n");
 }
 
 void test_null(void)
@@ -389,6 +469,10 @@ void start(void)
test_add_imm();
test_io();
test_eflags_insn();
+   test_jcc_short();
+   test_jcc_near();
+   /* test_call() uses short jump so call it after testing jcc */
+   test_call();
 
exit(0);
 }
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

My 'prohibitively expensive' is true only if you exit every packet.





Understood, but yet you need to do this if you want something like iSCSI
READ transactions to have as low-latency as possible.
  


Dunno, two microseconds is too much?  The wire imposes much more.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

Rusty Russell wrote:
  

On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
  


You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to support,
but as I understand it there are problems with the reliability of proper
callback (i.e. skb->destructor).

  

But if you have a UP guest,



I assume you mean UP host ;)

  


I think Rusty did mean a UP guest, and without schedule-and-forget.


Hmm..well I was hoping to be able to work with you guys to make my
proposal fit this role.  If there is no interest in that, I hope that my
infrastructure itself may still be considered for merging (in *some*
tree, not -kvm per se) as I would prefer to not maintain it out of tree
if it can be avoided.


The problem is that we already have virtio guest drivers going several 
kernel versions back, as well as Windows drivers.  We can't keep 
changing the infrastructure under people's feet.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> It's more of a "schedule and forget" which I think brings you the
>>> win.  The host disables notifications and schedules the actual tx work
>>> (rx from the host's perspective).  So now the guest and host continue
>>> producing and consuming packets in parallel.  So long as the guest is
>>> faster (due to the host being throttled?), notifications continue to
>>> be disabled.
>>> 
>> Yep, when the "producer::consumer" ratio is > 1, we mitigate
>> signaling. When its < 1, we signal roughly once per packet.
>>
>>  
>>> If you changed your rx_isr() to process the packets immediately
>>> instead of scheduling, I think throughput would drop dramatically.
>>> 
>> Right, that is the point. :) This is that "soft asic" thing I was
>> talking about yesterday.
>>   
>
> But all that has nothing to do with where the code lives, in the
> kernel or userspace.

Agreed, but note Ive already stated that some of my boost is likely from
in-kernel, while others are unrelated design elements such as the
"soft-asic" approach (you guys dont read my 10 page emails, do you? ;). 
I don't deny that some of my ideas could be used in userspace as well
(Credit if used would be appreciated :).

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
>
>
> My 'prohibitively expensive' is true only if you exit every packet.
>
>

Understood, but yet you need to do this if you want something like iSCSI
READ transactions to have as low-latency as possible.

-Greg




signature.asc
Description: OpenPGP digital signature


[PATCH] fix call near emulation

2009-04-02 Thread Gleb Natapov
The length of pushed on to the stack return address depends on operand
size not address size.

Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index ca91749..d7c9f6f 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -1792,7 +1792,6 @@ special_insn:
}
c->src.val = (unsigned long) c->eip;
jmp_rel(c, rel);
-   c->op_bytes = c->ad_bytes;
emulate_push(ctxt);
break;
}
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Gregory Haskins wrote:
> Rusty Russell wrote:
>   
>
>>  there will *never* be another packet in the queue
>> at this point, since it wasn't running.
>>   
>> 
> Yep, and I'll be the first to admit that my design only looks forward. 
>   
To clarify, I am referring to the internal design of the venet-tap
only.  The general vbus architecture makes no such policy decisions.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Rusty Russell wrote:
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>   
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to support,
>> but as I understand it there are problems with the reliability of proper
>> callback (i.e. skb->destructor).
>> 
>
> But if you have a UP guest,

I assume you mean UP host ;)

>  there will *never* be another packet in the queue
> at this point, since it wasn't running.
>   
Yep, and I'll be the first to admit that my design only looks forward. 
Its for high speed links and multi-core cpus, etc.  If you have a
uniprocessor host, the throughput would likely start to suffer with my
current strategy.  You could probably reclaim some of that throughput
(but trading latency) by doing as you are suggesting with the deferred
initial signalling.  However, it is still a tradeoff to account for the
lower-end rig.  I could certainly put a heuristic/timer on the
guest->host to mitigate this as well, but this is not my target use case
anyway so I am not sure it is worth it.


> As Avi said, you can do the processing in another thread and go back to the
> guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
> again before the thread did for exactly this kind of reason.
>
> While Avi's point about a "powerful enough userspace API" is probably valid,
> I don't think it's going to happen.  It's almost certainly less code to put a
> virtio_net server in the kernel, than it is to create such a powerful
> interface (see vringfd & tap).  And that interface would have one user in
> practice.
>
> So, let's roll out a kernel virtio_net server.  Anyone?
>   
Hmm..well I was hoping to be able to work with you guys to make my
proposal fit this role.  If there is no interest in that, I hope that my
infrastructure itself may still be considered for merging (in *some*
tree, not -kvm per se) as I would prefer to not maintain it out of tree
if it can be avoided.  I think people will find that the new logic
touches very few existing kernel lines at all, and can be completely
disabled with config options so it should be relatively inconsequential
to those that do not care.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

 


It's more of a "schedule and forget" which I think brings you the
win.  The host disables notifications and schedules the actual tx work
(rx from the host's perspective).  So now the guest and host continue
producing and consuming packets in parallel.  So long as the guest is
faster (due to the host being throttled?), notifications continue to
be disabled.

Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling. 
When its < 1, we signal roughly once per packet.


  

If you changed your rx_isr() to process the packets immediately
instead of scheduling, I think throughput would drop dramatically.


Right, that is the point. :) This is that "soft asic" thing I was
talking about yesterday.
  


But all that has nothing to do with where the code lives, in the kernel 
or userspace.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

Gregory Haskins wrote:

 



virtio is already non-kvm-specific (lguest uses it) and
non-pci-specific (s390 uses it).



Ok, then to be more specific, I need it to be more generic than it
already is.  For instance, I need it to be able to integrate with
shm_signals.  
  

Why?


Well, shm_signals is what I designed to be the event mechanism for vbus
devices.  One of the design criteria of shm_signal is that it should
support a variety of environments, such as kvm, but also something like
userspace apps.  So I cannot make assumptions about things like "pci
interrupts", etc.
  


virtio doesn't make these assumptions either.  The only difference I see 
is that you separate notification from the ring structure.



By your own words, the exit to userspace is "prohibitively expensive",
so that is either true or its not.  If its 2 microseconds, show me.


In user/test/x86/vmexit.c, change 'cpuid' to 'out %al, $0'; drop the 
printf() in kvmctl.c's test_outb().


I get something closer to 4 microseconds, but that's on a two year old 
machine;  It will be around two on Nehalems.


My 'prohibitively expensive' is true only if you exit every packet.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> Why does a kernel solution not need to know when a packet is
>>> transmitted?
>>>
>>> 
>>
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to support,
>> but as I understand it there are problems with the reliability of proper
>> callback (i.e. skb->destructor).
>>
>> Its "fire and forget" :)
>>   
>
> It's more of a "schedule and forget" which I think brings you the
> win.  The host disables notifications and schedules the actual tx work
> (rx from the host's perspective).  So now the guest and host continue
> producing and consuming packets in parallel.  So long as the guest is
> faster (due to the host being throttled?), notifications continue to
> be disabled.
Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling. 
When its < 1, we signal roughly once per packet.

>
> If you changed your rx_isr() to process the packets immediately
> instead of scheduling, I think throughput would drop dramatically.
Right, that is the point. :) This is that "soft asic" thing I was
talking about yesterday.

-Greg




signature.asc
Description: OpenPGP digital signature


PCI passtthrought & intel 82574L can't boot from disk

2009-04-02 Thread Hauke Hoffmann
Hi,

qemu-system-x86_64 runs well and i can boot and run the guest system. Thats 
works very well.

Command: 
/usr/local/kvm/bin/qemu-system-x86_64 -m 
512 -hda /var/VM/roadrunner.local/hda.qcow2 -smp 1 -vnc 
192.168.2.30: -net nic,macaddr=DE:AD:BE:EF:90:26 -net 
tap,ifname=tap0,script=no,downscript=no -boot c

Then i tried to add an intel 82574L network adapter to the guest.
Just the same command with addtionally "-pcidevice host=07:00.0"

Then i connected via VNC and see BIOS startpage and the following lines:
Initializing Intel(r) boot agent ge v1.3.21
pxe 2.1 build 086 (WfM 2.0)
Press f12 for moot menu

You can see a screenshot at http://nxt7.de/download/qemu.png

The guests keep on this point and nothing changes. (I have wait hours.)

I tried to press F12 in ThightVNC but no action. 
I must say that ThightVNC has problems with special chars (in my case).

At this point, i need your help.


Here are some details of my system

Kernel: 2.6.29 form kernel.org (self compiled)
kvm userspace: kvm-84 (self compiled)
OS: Ubuntu 8.04.2 server

r...@ls:~# lspci
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
01:09.0 Ethernet controller: Lite-On Communications Inc LNE100TX [Linksys 
EtherFast 10/100] (rev 25)
01:0a.0 VGA compatible controller: XGI Technology Inc. (eXtreme Graphics 
Innovation) Volari Z7
06:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363 AHCI 
Controller (rev 03)
06:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 AHCI 
Controller (rev 03)
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network 
Connection


r...@ls:~# lspci -tvvv
-[:00]-+-00.0  nVidia Corporation MCP55 Memory Controller
   +-01.0  nVidia Corporation MCP55 LPC Bridge
   +-01.1  nVidia Corporation MCP55 SMBus
   +-02.0  nVidia Corporation MCP55 USB Controller
   +-02.1  nVidia Corporation MCP55 USB Controller
   +-04.0  nVidia Corporation MCP55 IDE
   +-05.0  nVidia Corporation MCP55 SATA Controller
   +-05.1  nVidia Corporation MCP55 SATA Controller
   +-05.2  nVidia Corporation MCP55 SATA Controller
   +-06.0-[:01]--+-09.0  Lite-On Communications Inc LNE100TX 
[Linksys EtherFast 10/100]
   | \-0a.0  XGI Technology Inc. (eXtreme Graphics 
Innovation) Volari Z7
   +-08.0  nVidia Corporation MCP55 Ethernet
   +-09.0  nVidia Corporation MCP55 Ethernet
   +-0a.0-[:02]--
   +-0b.0-[:03]--
   +-0c.0-[:04]--
   +-0d.0-[:05]--
   +-0e.0-[:06]--+-00.0  JMicron Technologies, Inc. JMicron 
20360/20363 AHCI Controller
   | \-00.1  JMicron Technologies, Inc. JMicron 
20360/20363 AHCI Controller
   +-0f.0-[:07]00.0  Intel Corporation 82574L Gigabit Network 
Connection
   +-18.0  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
   +-18.1  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
   +-18.2  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
   \-18.3  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control


r...@ls:~# lspci -t
-[:00]-+-00.0
   +-01.0
   +-01.1
   +-02.0
   +-02.1
   +-04.0
   +-05.0
   +-05.1
   +-05.2
   +-06

Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> virtio is already non-kvm-specific (lguest uses it) and
>>> non-pci-specific (s390 uses it).
>>> 
>>
>> Ok, then to be more specific, I need it to be more generic than it
>> already is.  For instance, I need it to be able to integrate with
>> shm_signals.  
>
> Why?
Well, shm_signals is what I designed to be the event mechanism for vbus
devices.  One of the design criteria of shm_signal is that it should
support a variety of environments, such as kvm, but also something like
userspace apps.  So I cannot make assumptions about things like "pci
interrupts", etc.

So if I want to use it in vbus, virtio-ring has to be able to use them,
as opposed to what it does today. Part of this would be a natural fit
for the "kick()" callback in virtio, but there are other problems.  For
one, virtio-ring (IIUC) does its own event-masking directly in the
virtio metadata.  However, really I want the higher layer ring-overlay
to do its masking in terms of the lower-layered shm_signal in order to
work the way I envision this stuff.  If you look at the IOQ
implementation, this is exactly what it does.

To be clear, and Ive stated this in the past: venet is just an example
of this generic, in-kernel concept.  We plan on doing much much more
with all this.  One of the things we are working on is have userspace
clients be able to access this too, with an ultimately goal of
supporting things like having guest-userspace doing bypass, rdma, etc. 
We are not there yet, though...only the kvm-host to guest kernel is
currently functional and is thus the working example.

I totally "get" the attraction to doing things in userspace.  Its
contained, naturally isolated, easily supports migration, etc.  Its also
a penalty.  Bare-metal userspace apps have a direct path to the kernel
IO.  I want to give guest the same advantage.  Some people will care
more about things like migration than performance, and that is fine. 
But others will certainly care more about performance, and that is what
we are trying to address.

>
>  
>
>>> If you have a good exit mitigation scheme you can cut exits by a
>>> factor of 100; so the userspace exit costs are cut by the same
>>> factor.  If you have good copyless networking APIs you can cut the
>>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>>> but a kernel solution needs that too).
>>> 
>>
>> "exit mitigation' schemes are for bandwidth, not latency.  For latency
>> it all comes down to how fast you can signal in both directions.  If
>> someone is going to do a stand-alone request-reply, its generally always
>> going to be at least one hypercall and one rx-interrupt.  So your speed
>> will be governed by your signal path, not your buffer bandwidth.
>>   
>
> The userspace path is longer by 2 microseconds (for two additional
> heavyweight exits) and a few syscalls.  I don't think that's worthy of
> putting all the code in the kernel.

By your own words, the exit to userspace is "prohibitively expensive",
so that is either true or its not.  If its 2 microseconds, show me.  We
need the rtt time to go from a "kick" PIO all the way to queue a packet
on the egress hardware and return.  That is going to define your
latency.  If you can do this such that you can do something like ICMP
ping in 65us (or anything close to a few dozen microseconds of this),
I'll shut-up about how much I think the current path sucks ;)  Even so,
I still propose the concept of a frame-work for in-kernel devices for
all the other reasons I mentioned above.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Rusty Russell
On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> You do not need to know when the packet is copied (which I currently
> do).  You only need it for zero-copy (of which I would like to support,
> but as I understand it there are problems with the reliability of proper
> callback (i.e. skb->destructor).

But if you have a UP guest, there will *never* be another packet in the queue
at this point, since it wasn't running.

As Avi said, you can do the processing in another thread and go back to the
guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
again before the thread did for exactly this kind of reason.

While Avi's point about a "powerful enough userspace API" is probably valid,
I don't think it's going to happen.  It's almost certainly less code to put a
virtio_net server in the kernel, than it is to create such a powerful
interface (see vringfd & tap).  And that interface would have one user in
practice.

So, let's roll out a kernel virtio_net server.  Anyone?
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

 


Why does a kernel solution not need to know when a packet is transmitted?




You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to support,
but as I understand it there are problems with the reliability of proper
callback (i.e. skb->destructor).

Its "fire and forget" :)
  


It's more of a "schedule and forget" which I think brings you the win.  
The host disables notifications and schedules the actual tx work (rx 
from the host's perspective).  So now the guest and host continue 
producing and consuming packets in parallel.  So long as the guest is 
faster (due to the host being throttled?), notifications continue to be 
disabled.


If you changed your rx_isr() to process the packets immediately instead 
of scheduling, I think throughput would drop dramatically.


Mark had a similar change for virtio.  Mark?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

 


There is no choice.  Exiting from the guest to the kernel to userspace
is prohibitively expensive, you can't do that on every packet.




Now you are making my point ;)  This is part of the cost of your
signaling path, and it directly adds to your latency time.   


It adds a microsecond.  The kvm overhead of putting things in userspace 
is low enough, I don't know why people keep mentioning it.  The problem 
is the kernel/user networking interfaces.



You can't
buffer packets here if the guest is only going to send one and wait for
a response and expect that to perform well.  And this is precisely what
drove me to look at avoiding going back to userspace in the first place.
  


We're not buffering any packets.  What we lack is a way to tell the 
guest that we're done processing all packets in the ring (IOW, re-enable 
notifications).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Avi Kivity

Gregory Haskins wrote:

 


virtio is already non-kvm-specific (lguest uses it) and
non-pci-specific (s390 uses it).



Ok, then to be more specific, I need it to be more generic than it
already is.  For instance, I need it to be able to integrate with
shm_signals.  


Why?

 


If you have a good exit mitigation scheme you can cut exits by a
factor of 100; so the userspace exit costs are cut by the same
factor.  If you have good copyless networking APIs you can cut the
cost of copies to zero (well, to the cost of get_user_pages_fast(),
but a kernel solution needs that too).



"exit mitigation' schemes are for bandwidth, not latency.  For latency
it all comes down to how fast you can signal in both directions.  If
someone is going to do a stand-alone request-reply, its generally always
going to be at least one hypercall and one rx-interrupt.  So your speed
will be governed by your signal path, not your buffer bandwidth.
  


The userspace path is longer by 2 microseconds (for two additional 
heavyweight exits) and a few syscalls.  I don't think that's worthy of 
putting all the code in the kernel.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 09/17] net: Add vbus_enet driver

2009-04-02 Thread Gregory Haskins
Stephen Hemminger wrote:
> On Tue, 31 Mar 2009 14:43:34 -0400
> Gregory Haskins  wrote:
>   
>> +struct vbus_enet_priv {
>> +spinlock_t lock;
>> +struct net_device *dev;
>> +struct vbus_device_proxy  *vdev;
>> +struct napi_struct napi;
>> +struct net_device_statsstats;
>> 
>
> Not needed any more, stats are available in net_device
>
>   

Thanks for the review, Stephen!

I will apply all of your recommended fixes for the next release.

-Greg



signature.asc
Description: OpenPGP digital signature


virtio_net: MAC address releated breakage if there is no MAC area in config

2009-04-02 Thread Christian Borntraeger
Hello,

commit 9c46f6d42f1b5627c49a5906cb5b315ad8716ff0
Author: Alex Williamson 
Date:   Wed Feb 4 16:36:34 2009 -0800
virtio_net: Allow setting the MAC address of the NIC

Introduced an unconditional config->set to the MAC address field of
the device config.


+   } else {
random_ether_addr(dev->dev_addr);
+   vdev->config->set(vdev, offsetof(struct virtio_net_config, mac),
+ dev->dev_addr, dev->addr_len);
+   }

Since our kuli userspace sample does not set VIRTIO_NET_F_MAC, there is no
config space assigned for this device. When virtio_net tries to overwrite
the non-existing field, this triggers a bug.

virtio_net.h specifies:
struct virtio_net_config
{
/* The config defining mac address (if VIRTIO_NET_F_MAC) */
__u8 mac[6];
[...]

I read this as the mac config field is optional (similar to all the optional
fields we added in virtio_blk later).

I see two options:
1. Change our sample userspace to always allocate the config (like lguest and
qemu)
2. Change the kernel code to not write into the config unless a specific feature
bit is set. (e.g. VIRTIO_NET_F_SETMAC)


Opinions?

PS: The same is true for virtnet_set_mac_address. it crashes as well
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-02 Thread Izik Eidus

Anthony Liguori wrote:

Chris Wright wrote:

* Anthony Liguori (anth...@codemonkey.ws) wrote:
 
The ioctl() interface is quite bad for what you're doing.  You're  
telling the kernel extra information about a VA range in 
userspace.   That's what madvise is for.  You're tweaking simple 
read/write values of  kernel infrastructure.  That's what sysfs is for.



I agree re: sysfs (brought it up myself before).  As far as madvise vs.
ioctl, the one thing that comes from the ioctl is fops->release to
automagically unregister memory on exit.


This is precisely why ioctl() is a bad interface.  fops->release isn't 
tied to the process but rather tied to the open file.  The file can 
stay open long after the process exits either by a fork()'d child 
inheriting the file descriptor or through something more sinister like 
SCM_RIGHTS.


In fact, a common mistake is to leak file descriptors by not closing 
them when exec()'ing a process.  Instead of just delaying a close, if 
you rely on this behavior to unregister memory regions, you could 
potentially have badness happen in the kernel if ksm attempted to 
access an invalid memory region. 

How could such badness ever happen in the kernel?
Ksm work by virtual addresses!, it fetch the pages by using 
get_user_pages(), and the mm struct is protected by get_task_mm(), in 
addion we take the down_read(mmap_sem)


So how could ksm ever acces to invalid memory region unless the host 
page table or get_task_mm() would stop working!


When someone register memory for scan, we do get_task_mm() when the file 
is closed or when he say that he dont want this to be registered anymore 
he call the unregister ioctl



You can aurgoment about API, but this is mathamathical thing to say Ksm 
is insecure, please show me senario!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Herbert Xu wrote:
>> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>>  
>>> If tap told us when the packets were actually transmitted, life
>>> would be  wonderful:
>>> 
>>
>> And why do we need this? Because we are in user space!
>>
>>   
>
> Why does a kernel solution not need to know when a packet is transmitted?
>

You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to support,
but as I understand it there are problems with the reliability of proper
callback (i.e. skb->destructor).

Its "fire and forget" :)

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Herbert Xu wrote:
>> Avi Kivity  wrote:
>>  
>>> virtio is already non-kvm-specific (lguest uses it) and
>>> non-pci-specific (s390 uses it).
>>> 
>>
>> I think Greg's work shows that putting the backend in the kernel
>> can dramatically reduce the cost of a single guest->host transaction.
>> I'm sure the same thing would work for virtio too.
>>   
>
> Virtio suffers because we've had no notification of when a packet is
> actually submitted.  With the notification, the only difference should
> be in the cost of a kernel->user switch, which is nowhere nearly as
> dramatic.
>
>>> If you have a good exit mitigation scheme you can cut exits by a
>>> factor of 100; so the userspace exit costs are cut by the same
>>> factor.  If you have good copyless networking APIs you can cut the
>>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>>> but a kernel solution needs that too).
>>> 
>>
>> Given the choice of having to mitigate or not having the problem
>> in the first place, guess what I would prefer :)
>>   
>
> There is no choice.  Exiting from the guest to the kernel to userspace
> is prohibitively expensive, you can't do that on every packet.
>

Now you are making my point ;)  This is part of the cost of your
signaling path, and it directly adds to your latency time.   You can't
buffer packets here if the guest is only going to send one and wait for
a response and expect that to perform well.  And this is precisely what
drove me to look at avoiding going back to userspace in the first place.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>
>>
>> I think there is a slight disconnect here.  This is *exactly* what I am
>> trying to do.  You can of course do this many ways, and I am not denying
>> it could be done a different way than the path I have chosen.  One
>> extreme would be to just slam a virtio-net specific chunk of code
>> directly into kvm on the host.  Another extreme would be to build a
>> generic framework into Linux for declaring arbitrary IO types,
>> integrating it with kvm (as well as other environments such as lguest,
>> userspace, etc), and building a virtio-net model on top of that.
>>
>> So in case it is not obvious at this point, I have gone with the latter
>> approach.  I wanted to make sure it wasn't kvm specific or something
>> like pci specific so it had the broadest applicability to a range of
>> environments.  So that is why the design is the way it is.  I understand
>> that this approach is technically "harder/more-complex" than the "slam
>> virtio-net into kvm" approach, but I've already done that work.  All we
>> need to do now is agree on the details ;)
>>
>>   
>
> virtio is already non-kvm-specific (lguest uses it) and
> non-pci-specific (s390 uses it).

Ok, then to be more specific, I need it to be more generic than it
already is.  For instance, I need it to be able to integrate with
shm_signals.  If we can do that without breaking the existing ABI, that
would be great!  Last I looked, it was somewhat entwined here so I didnt
try...but I admit that I didnt try that hard since I already had the IOQ
library ready to go.

>
>>> That said, I don't think we're bound today by the fact that we're in
>>> userspace.
>>> 
>> You will *always* be bound by the fact that you are in userspace.  Its
>> purely a question of "how much" and "does anyone care".Right now,
>> the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
>> do".  I have no doubt that this can and will change/improve in the
>> future.  But it will always be true that no matter how much userspace
>> improves, the kernel based solution will always be faster.  Its simple
>> physics.  I'm cutting out the middleman to ultimately reach the same
>> destination as the userspace path, so userspace can never be equal.
>>   
>
> If you have a good exit mitigation scheme you can cut exits by a
> factor of 100; so the userspace exit costs are cut by the same
> factor.  If you have good copyless networking APIs you can cut the
> cost of copies to zero (well, to the cost of get_user_pages_fast(),
> but a kernel solution needs that too).

"exit mitigation' schemes are for bandwidth, not latency.  For latency
it all comes down to how fast you can signal in both directions.  If
someone is going to do a stand-alone request-reply, its generally always
going to be at least one hypercall and one rx-interrupt.  So your speed
will be governed by your signal path, not your buffer bandwidth.

What Ive done is shown that you can use techniques other than buffering
the head of the queue to do exit mitigation for bandwidth, while still
maintaining a very short signaling path for latency.  And I also argue
that the latter will always be optimal in the kernel, though I know by
which degree is still TBD.  Anthony thinks he can make the difference
negligible, and I would love to see it but am skeptical.

-Greg





signature.asc
Description: OpenPGP digital signature


Qemu process in Guest

2009-04-02 Thread Kumar, Venkat
1. How does Qemu process start running in Guest?
2. How does a guest's I/O request get trapped into the user mode qemu process?

Thx,

Venkat


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm : qemu : fix compilation error in kvm-userspace for ia64

2009-04-02 Thread Avi Kivity

Zhang, Yang wrote:

when using make in kernel, it can not find msidef.h. This patch
fix this. 
  


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Qemu: fix compilation error in IA64

2009-04-02 Thread Avi Kivity

Zhang, Yang wrote:

fix compilation error in IA64

  


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >