Re: [PATCH RFC 2/5] vringfd base/offset

2008-04-07 Thread Arnd Bergmann
On Saturday 05 April 2008, Rusty Russell wrote:
>  static const struct file_operations vring_fops = {
> .release= vring_release,
> .write  = vring_write,
> .poll   = vring_poll,
> +   .ioctl  = vring_ioctl,
>  };

This should also set the .compat_ioctl pointer. If the argument 
is defined as an actual memory location instead of a pointer,
you also don't need the compat_ptr() conversion for s390x, so
both can point to the same vring_ioctl function.

Arnd <><
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 1/5] vringfd syscall

2008-04-07 Thread Arnd Bergmann
On Saturday 05 April 2008, Rusty Russell wrote:
> +asmlinkage long sys_vringfd(void __user *addr,
> +       unsigned num_descs,
> +       u16 __user *last_used)
> +{
> +   int fd, err;
> +   struct file *filp;
> +   struct vring_info *vr;
> +
> +   /* Must be a power of two, and representable by u16 */
> +   if (!num_descs || (num_descs & (num_descs-1)) || num_descs > 65536) {
> +   err = -EINVAL;
> +   goto out;
> +   }
> +
> +   fd = get_unused_fd();
> +   if (fd < 0) {
> +   err = fd;
> +   goto out;
> +   }
> +
> +   filp = alloc_file(vring_mnt, dget(vring_mnt->mnt_root), FMODE_WRITE,
> +     &vring_fops);
> +   if (!filp) {
> +   err = -ENFILE;
> +   goto put_fd;
> +   }

This looks like a candidate for anon_inode_getfd(), which would let you
get rid of the code for registering your own file system.

Arnd <><
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Call for Presentations: KVM Forum 2008

2008-04-07 Thread Rusty Russell
On Tuesday 08 April 2008 10:17:26 Avi Kivity wrote:
> [Note: KVM Forum registration is now open at
> http://kforum.qumranet.com/KVMForum/about_kvmforum.php]
>
> This is the Call for Presentations for the second annual KVM Developer's
> Forum, to be held on June 10-13, 2008, in Napa, California, USA [1].

Nak, unf.  I've been told pretty clearly that I'm only doing one trip this 
year, and I'm on it.

I'm sure Anthony to channel my opinions with no effort :)

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 17:35:28 David Miller wrote:
> From: Rusty Russell <[EMAIL PROTECTED]>
> Date: Mon, 7 Apr 2008 17:24:51 +1000
>
> > On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
> > > On second thought, this is not going to work.  The network stack
> > > can clone individual pages out of this skb and put it into a new
> > > skb.  Therefore whatever scheme we come up with will either need
> > > to be page-based, or add a flag to tell the network stack that it
> > > can't clone those pages.
> >
> > Erk... I'll put in the latter for now.   A page-level solution is not
> > really an option: if userspace hands us mmaped pages for example.
>
> Keep in mind that the core of the TCP stack really depends
> upon being able to slice and dice paged SKBs as is pleases
> in order to send packets out.
>
> In fact, it also does such splitting during SACK processing.
>
> It really is a base requirement for efficient TSO support.
> Otherwise the above operations would be so incredibly
> expensive we might as well rip all of the TSO support out.

In this case that's OK; these are only for packets coming from 
userspace/guest, so the only interaction with the tcp code is possibly as a 
recipient.

If tcp wanted to do zero-copy aio xmit, I think we'd need something else.

Cheers,
Rusty.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 1/5] vringfd syscall

2008-04-07 Thread Rusty Russell
On Tuesday 08 April 2008 03:54:34 Jonathan Corbet wrote:
> Hey, Rusty,
>
> > For virtualization, we've developed virtio_ring for efficient
> > communication. This would also work well for userspace-kernel
> > communication, particularly for things like the tun device.  By using the
> > same ABI, we can join guests to the host kernel trivially.
>
> I'm *sure* you meant to document that somewhat non-trivial proposed new
> kernel API as soon as you got a moment.

Actually, yes.  But I wanted to get it out there before I start the treck 
across to the virtualization summit.

A few points:
'The page alignment for the used array is important - that array might be 
mapped separately into kernel space.'
   Well, the used array is written by one side only, so it's possible to split 
the ring here and make each part r/o to the other side.  More importantly, a 
page boundary is almost certainly a cacheline boundary, and we already have a 
userspace interface for it.

'Note that the flags fields in the vring_avail and vring_used structures 
appear to be unused.'
   virtio uses these for wakeup/interrupt suppression.  It's a cheap way to 
avoid hypercalls, and we can use them the same way to avoid system calls (you 
set the suppression bit while you're actually looking at the ring).

The need for the kmap (and hence the atomic horror) has now been alleviated: I 
changed the shinfo destructor code to allow the destructor to hold onto the 
skb data so it can queue it and free it later.

BTW, the only place currently where both output and input buffers are used is 
the virtio_blk driver doing a read, where the header describes the operation, 
and the other buffers are overwritten with the data.

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Call for Presentations: KVM Forum 2008

2008-04-07 Thread Avi Kivity
[Note: KVM Forum registration is now open at 
http://kforum.qumranet.com/KVMForum/about_kvmforum.php]

This is the Call for Presentations for the second annual KVM Developer's
Forum, to be held on June 10-13, 2008, in Napa, California, USA [1].  We
are looking for presentations on KVM development, quality assurance,
management, security, interoperability, architecture support, and 
interesting use cases.  Presentations are 50 minutes in length; there 
are also 25-minute mini-presentation slots available.

KVM Forum presentations are an excellent way to inform the KVM 
development community about your work, and to gather valuable feedback 
about your approach.

Please send your presentation proposal to the KVM Forum 2008 Content
Committee at [EMAIL PROTECTED] by April 20th.

KVM Forum 2008 Content Committee:
 Dor Laor
 Anthony Liguori
 Avi Kivity

[1] http://kforum.qumranet.com/KVMForum/about_kvmforum.php

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio_net: remove overzealous printk

2008-04-07 Thread Anthony Liguori
The 'disable_cb' is really just a hint and as such, it's possible for more
work to get queued up while callbacks are disabled.  Under stress with an
SMP guest, this printk triggers very frequently.  There is no race here, this
is how things are designed to work so let's just remove the printk.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>
Acked-by: Rusty Russell <[EMAIL PROTECTED]>

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b58472c..d1a200f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -284,7 +284,6 @@ again:
/* Activate callback for using skbs: if this returns false it
 * means some were used in the meantime. */
if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
-   printk("Unlikely: restart svq race\n");
vi->svq->vq_ops->disable_cb(vi->svq);
netif_start_queue(dev);
goto again;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: remove overzealous BUG_ON.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 23:50:24 Anthony Liguori wrote:
> Rusty Russell wrote:
> > The 'disable_cb' callback is designed as an optimization to tell the host
> > we don't need callbacks now.  As it is not reliable, the debug check is
> > overzealous: it can happen on two CPUs at the same time.  Document this.
> >
> > Even if it were reliable, the virtio_net driver doesn't disable
> > callbacks on transmit so the START_USE/END_USE debugging reentrance
> > protection can be easily tripped even on UP.
> >
> > Thanks to Balaji Rao for the bug report and testing.
>
> I think the printk() in start_xmit should be removed too (or at least
> rate limited).

Yep, please send patch.  I'm about to jump on a plane, so consider this a 
pre-canned ack.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-04-07 Thread Yinghai Lu
On Sun, Apr 6, 2008 at 9:38 AM, H. Peter Anvin <[EMAIL PROTECTED]> wrote:
> Yinghai Lu wrote:
>
> > so will cost every bzImage extra memory copy? that could be 18M or
> > even more big.
> >
>
>  I wouldn't worry about that.  You will typically have several copies of the
> images during the execution of the boot loader.

i put all drivers needed in kernel.
1. bootloader copy bzImage (6M) to memory
2. arch/x86/boot/compressed/head_32.S, will copy bzImage to end of
buffer to do uncompress on possiton.
3. parse_elf will copy the vmlinux (the uncompressed, that is some big, 18M)

I suggest that could have special elf header, and will only have one
PT_LOAD, and avoid the copy, and just offset start address of
uncompressed kernel for jump later.

YH
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-04-07 Thread Yinghai Lu
On Sun, Apr 6, 2008 at 10:25 AM, H. Peter Anvin <[EMAIL PROTECTED]> wrote:
>
> Yinghai Lu wrote:
>
> > On Sun, Apr 6, 2008 at 9:38 AM, H. Peter Anvin <[EMAIL PROTECTED]> wrote:
> >
> > > Yinghai Lu wrote:
> > >
> > >
> > > > so will cost every bzImage extra memory copy? that could be 18M or
> > > > even more big.
> > > >
> > > >
> > >  I wouldn't worry about that.  You will typically have several copies of
> the
> > > images during the execution of the boot loader.
> > >
> >
> > i put all drivers needed in kernel.
> > 1. bootloader copy bzImage (6M) to memory
> > 2. arch/x86/boot/compressed/head_32.S, will copy bzImage to end of
> > buffer to do uncompress on possiton.
> > 3. parse_elf will copy the vmlinux (the uncompressed, that is some big,
> 18M)
> >
> > I suggest that could have special elf header, and will only have one
> > PT_LOAD, and avoid the copy, and just offset start address of
> > uncompressed kernel for jump later.
> >
>
>  Once again, I think you will have a hard time measuring the time
> difference.

ok.

forget about my proposal..

YH
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-04-07 Thread Yinghai Lu
On Wed, Feb 13, 2008 at 1:54 PM, Ian Campbell <[EMAIL PROTECTED]> wrote:
> This allows other boot loaders such as the Xen domain builder the
>  opportunity to extract the ELF file.
>
>  Signed-off-by: Ian Campbell <[EMAIL PROTECTED]>
>  Cc: Thomas Gleixner <[EMAIL PROTECTED]>
>  Cc: Ingo Molnar <[EMAIL PROTECTED]>
>  Cc: H. Peter Anvin <[EMAIL PROTECTED]>
>  Cc: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
>  Cc: virtualization@lists.linux-foundation.org
>  ---
>   Documentation/i386/boot.txt   |   18 
>   arch/x86/boot/Makefile|   14 +
>   arch/x86/boot/compressed/Makefile |2 +-
>   arch/x86/boot/compressed/misc.c   |   56 
> +
>   arch/x86/boot/header.S|6 
>   5 files changed, 95 insertions(+), 1 deletions(-)
>
>  diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
>  index fc49b79..b5f5ba1 100644
>  --- a/Documentation/i386/boot.txt
>  +++ b/Documentation/i386/boot.txt
>  @@ -170,6 +170,8 @@ Offset  Proto   NameMeaning
>   0238/4 2.06+   cmdline_sizeMaximum size of the kernel command line
>   023C/4 2.07+   hardware_subarch Hardware subarchitecture
>   0240/8 2.07+   hardware_subarch_data Subarchitecture-specific data
>  +0248/4 2.08+   compressed_payload_offset
>  +024C/4 2.08+   compressed_payload_length
>
>   (1) For backwards compatibility, if the setup_sects field contains 0, the
>  real value is 4.
>  @@ -512,6 +514,22 @@ Protocol:  2.07+
>
>A pointer to data that is specific to hardware subarch
>
>  +Field name:compressed_payload_offset
>  +Type:  read
>  +Offset/size:   0x248/4
>  +Protocol:  2.08+
>  +
>  +  If non-zero then this field contains the offset from the end of the
>  +  real-mode code to the compressed payload. The compression format
>  +  should be determined using the standard magic number, currently only
>  +  gzip is used.
>  +
>  +Field name:compressed_payload_length
>  +Type:  read
>  +Offset/size:   0x24c/4
>  +Protocol:  2.08+
>  +
>  +  The length of the compressed payload.
>
>    THE KERNEL COMMAND LINE
>
>  diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
>  index f88458e..9695aff 100644
>  --- a/arch/x86/boot/Makefile
>  +++ b/arch/x86/boot/Makefile
>  @@ -94,6 +94,20 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
>
>   SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
>
>  +sed-offsets := -e 's/^00*/0/' \
>  +-e 's/^\([0-9a-fA-F]*\) . \(input_data\|input_data_end\)$$/\#define 
> \2 0x\1/p'
>  +
>  +quiet_cmd_offsets = OFFSETS $@
>  +  cmd_offsets = $(NM) $< | sed -n $(sed-offsets) > $@
>  +
>  +$(obj)/offsets.h: $(obj)/compressed/vmlinux FORCE
>  +   $(call if_changed,offsets)
>  +
>  +targets += offsets.h
>  +
>  +AFLAGS_header.o += -I$(obj)
>  +$(obj)/header.o: $(obj)/offsets.h
>  +
>   LDFLAGS_setup.elf  := -T
>   $(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE
> $(call if_changed,ld)
>  diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
>  index d2b9f3b..92fdd35 100644
>  --- a/arch/x86/boot/compressed/Makefile
>  +++ b/arch/x86/boot/compressed/Makefile
>  @@ -22,7 +22,7 @@ $(obj)/vmlinux: $(src)/vmlinux_$(BITS).lds 
> $(obj)/head_$(BITS).o $(obj)/misc.o $
> $(call if_changed,ld)
> @:
>
>  -OBJCOPYFLAGS_vmlinux.bin := -O binary -R .note -R .comment -S
>  +OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S
>   $(obj)/vmlinux.bin: vmlinux FORCE
> $(call if_changed,objcopy)
>
>  diff --git a/arch/x86/boot/compressed/misc.c 
> b/arch/x86/boot/compressed/misc.c
>  index 8182e32..69aec2f 100644
>  --- a/arch/x86/boot/compressed/misc.c
>  +++ b/arch/x86/boot/compressed/misc.c
>  @@ -15,6 +15,10 @@
>   * we just keep it from happening
>   */
>   #undef CONFIG_PARAVIRT
>  +#ifdef CONFIG_X86_32
>  +#define _ASM_DESC_H_ 1
>  +#endif
>  +
>   #ifdef CONFIG_X86_64
>   #define _LINUX_STRING_H_ 1
>   #define __LINUX_BITMAP_H 1
>  @@ -22,6 +26,7 @@
>
>   #include 
>   #include 
>  +#include 
>   #include 
>   #include 
>   #include 
>  @@ -365,6 +370,56 @@ static void error(char *x)
> asm("hlt");
>   }
>
>  +static void parse_elf(void *output)
>  +{
>  +#ifdef CONFIG_X86_64
>  +   Elf64_Ehdr ehdr;
>  +   Elf64_Phdr *phdrs, *phdr;
>  +#else
>  +   Elf32_Ehdr ehdr;
>  +   Elf32_Phdr *phdrs, *phdr;
>  +#endif
>  +   void *dest;
>  +   int i;
>  +
>  +   memcpy(&ehdr, output, sizeof(ehdr));
>  +   if(ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
>  +  ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
>  +  ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
>  +  ehdr.e_ident[EI_MAG3] != ELFMAG3)
>  +   {
>  +   error("Kernel is not a valid ELF file");
>  +   return;
>  +   }
>  +
>  +   putstr("Parsing ELF... ");
>  +
>  +   phdrs = malloc(sizeof(*phdrs) * ehdr.e_phnum);
>  +   if (!phdrs)
>  +   error("Failed to allocate space f

Re: [PATCH] virtio: remove overzealous BUG_ON.

2008-04-07 Thread Anthony Liguori
Rusty Russell wrote:
> The 'disable_cb' callback is designed as an optimization to tell the host
> we don't need callbacks now.  As it is not reliable, the debug check is
> overzealous: it can happen on two CPUs at the same time.  Document this.
> 
> Even if it were reliable, the virtio_net driver doesn't disable
> callbacks on transmit so the START_USE/END_USE debugging reentrance
> protection can be easily tripped even on UP.
> 
> Thanks to Balaji Rao for the bug report and testing.

I think the printk() in start_xmit should be removed too (or at least 
rate limited).

Regards,

Anthony Liguori

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread David Miller
From: Rusty Russell <[EMAIL PROTECTED]>
Date: Mon, 7 Apr 2008 17:24:51 +1000

> On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
> > On second thought, this is not going to work.  The network stack
> > can clone individual pages out of this skb and put it into a new
> > skb.  Therefore whatever scheme we come up with will either need
> > to be page-based, or add a flag to tell the network stack that it
> > can't clone those pages.
> 
> Erk... I'll put in the latter for now.   A page-level solution is not really 
> an option: if userspace hands us mmaped pages for example.

Keep in mind that the core of the TCP stack really depends
upon being able to slice and dice paged SKBs as is pleases
in order to send packets out.

In fact, it also does such splitting during SACK processing.

It really is a base requirement for efficient TSO support.
Otherwise the above operations would be so incredibly
expensive we might as well rip all of the TSO support out.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
> Rusty Russell <[EMAIL PROTECTED]> wrote:
> > +/* We are done with this skb: put it in the used pile. */
> > +static void skb_finished(struct skb_shared_info *sinfo)
> > +{
> > +   struct skb_shinfo_tun *sht = (void *)(sinfo + 1);
> > +
> > +   /* FIXME: Race prevention */
> > +   vring_used_buffer_atomic(sht->tun->outring, sht->id, sht->len);
> > +   vring_wake(sht->tun->outring);
> > +
> > +   /* Release device. */
> > +   dev_put(sht->tun->dev);
> > +}
>
> On second thought, this is not going to work.  The network stack
> can clone individual pages out of this skb and put it into a new
> skb.  Therefore whatever scheme we come up with will either need
> to be page-based, or add a flag to tell the network stack that it
> can't clone those pages.

Erk... I'll put in the latter for now.   A page-level solution is not really 
an option: if userspace hands us mmaped pages for example.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization