Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
 Rusty Russell [EMAIL PROTECTED] wrote:
  +/* We are done with this skb: put it in the used pile. */
  +static void skb_finished(struct skb_shared_info *sinfo)
  +{
  +   struct skb_shinfo_tun *sht = (void *)(sinfo + 1);
  +
  +   /* FIXME: Race prevention */
  +   vring_used_buffer_atomic(sht-tun-outring, sht-id, sht-len);
  +   vring_wake(sht-tun-outring);
  +
  +   /* Release device. */
  +   dev_put(sht-tun-dev);
  +}

 On second thought, this is not going to work.  The network stack
 can clone individual pages out of this skb and put it into a new
 skb.  Therefore whatever scheme we come up with will either need
 to be page-based, or add a flag to tell the network stack that it
 can't clone those pages.

Erk... I'll put in the latter for now.   A page-level solution is not really 
an option: if userspace hands us mmaped pages for example.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread David Miller
From: Rusty Russell [EMAIL PROTECTED]
Date: Mon, 7 Apr 2008 17:24:51 +1000

 On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
  On second thought, this is not going to work.  The network stack
  can clone individual pages out of this skb and put it into a new
  skb.  Therefore whatever scheme we come up with will either need
  to be page-based, or add a flag to tell the network stack that it
  can't clone those pages.
 
 Erk... I'll put in the latter for now.   A page-level solution is not really 
 an option: if userspace hands us mmaped pages for example.

Keep in mind that the core of the TCP stack really depends
upon being able to slice and dice paged SKBs as is pleases
in order to send packets out.

In fact, it also does such splitting during SACK processing.

It really is a base requirement for efficient TSO support.
Otherwise the above operations would be so incredibly
expensive we might as well rip all of the TSO support out.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-04-07 Thread Yinghai Lu
On Wed, Feb 13, 2008 at 1:54 PM, Ian Campbell [EMAIL PROTECTED] wrote:
 This allows other boot loaders such as the Xen domain builder the
  opportunity to extract the ELF file.

  Signed-off-by: Ian Campbell [EMAIL PROTECTED]
  Cc: Thomas Gleixner [EMAIL PROTECTED]
  Cc: Ingo Molnar [EMAIL PROTECTED]
  Cc: H. Peter Anvin [EMAIL PROTECTED]
  Cc: Jeremy Fitzhardinge [EMAIL PROTECTED]
  Cc: virtualization@lists.linux-foundation.org
  ---
   Documentation/i386/boot.txt   |   18 
   arch/x86/boot/Makefile|   14 +
   arch/x86/boot/compressed/Makefile |2 +-
   arch/x86/boot/compressed/misc.c   |   56 
 +
   arch/x86/boot/header.S|6 
   5 files changed, 95 insertions(+), 1 deletions(-)

  diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt
  index fc49b79..b5f5ba1 100644
  --- a/Documentation/i386/boot.txt
  +++ b/Documentation/i386/boot.txt
  @@ -170,6 +170,8 @@ Offset  Proto   NameMeaning
   0238/4 2.06+   cmdline_sizeMaximum size of the kernel command line
   023C/4 2.07+   hardware_subarch Hardware subarchitecture
   0240/8 2.07+   hardware_subarch_data Subarchitecture-specific data
  +0248/4 2.08+   compressed_payload_offset
  +024C/4 2.08+   compressed_payload_length

   (1) For backwards compatibility, if the setup_sects field contains 0, the
  real value is 4.
  @@ -512,6 +514,22 @@ Protocol:  2.07+

A pointer to data that is specific to hardware subarch

  +Field name:compressed_payload_offset
  +Type:  read
  +Offset/size:   0x248/4
  +Protocol:  2.08+
  +
  +  If non-zero then this field contains the offset from the end of the
  +  real-mode code to the compressed payload. The compression format
  +  should be determined using the standard magic number, currently only
  +  gzip is used.
  +
  +Field name:compressed_payload_length
  +Type:  read
  +Offset/size:   0x24c/4
  +Protocol:  2.08+
  +
  +  The length of the compressed payload.

    THE KERNEL COMMAND LINE

  diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
  index f88458e..9695aff 100644
  --- a/arch/x86/boot/Makefile
  +++ b/arch/x86/boot/Makefile
  @@ -94,6 +94,20 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE

   SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))

  +sed-offsets := -e 's/^00*/0/' \
  +-e 's/^\([0-9a-fA-F]*\) . \(input_data\|input_data_end\)$$/\#define 
 \2 0x\1/p'
  +
  +quiet_cmd_offsets = OFFSETS $@
  +  cmd_offsets = $(NM) $ | sed -n $(sed-offsets)  $@
  +
  +$(obj)/offsets.h: $(obj)/compressed/vmlinux FORCE
  +   $(call if_changed,offsets)
  +
  +targets += offsets.h
  +
  +AFLAGS_header.o += -I$(obj)
  +$(obj)/header.o: $(obj)/offsets.h
  +
   LDFLAGS_setup.elf  := -T
   $(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE
 $(call if_changed,ld)
  diff --git a/arch/x86/boot/compressed/Makefile 
 b/arch/x86/boot/compressed/Makefile
  index d2b9f3b..92fdd35 100644
  --- a/arch/x86/boot/compressed/Makefile
  +++ b/arch/x86/boot/compressed/Makefile
  @@ -22,7 +22,7 @@ $(obj)/vmlinux: $(src)/vmlinux_$(BITS).lds 
 $(obj)/head_$(BITS).o $(obj)/misc.o $
 $(call if_changed,ld)
 @:

  -OBJCOPYFLAGS_vmlinux.bin := -O binary -R .note -R .comment -S
  +OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S
   $(obj)/vmlinux.bin: vmlinux FORCE
 $(call if_changed,objcopy)

  diff --git a/arch/x86/boot/compressed/misc.c 
 b/arch/x86/boot/compressed/misc.c
  index 8182e32..69aec2f 100644
  --- a/arch/x86/boot/compressed/misc.c
  +++ b/arch/x86/boot/compressed/misc.c
  @@ -15,6 +15,10 @@
   * we just keep it from happening
   */
   #undef CONFIG_PARAVIRT
  +#ifdef CONFIG_X86_32
  +#define _ASM_DESC_H_ 1
  +#endif
  +
   #ifdef CONFIG_X86_64
   #define _LINUX_STRING_H_ 1
   #define __LINUX_BITMAP_H 1
  @@ -22,6 +26,7 @@

   #include linux/linkage.h
   #include linux/screen_info.h
  +#include linux/elf.h
   #include asm/io.h
   #include asm/page.h
   #include asm/boot.h
  @@ -365,6 +370,56 @@ static void error(char *x)
 asm(hlt);
   }

  +static void parse_elf(void *output)
  +{
  +#ifdef CONFIG_X86_64
  +   Elf64_Ehdr ehdr;
  +   Elf64_Phdr *phdrs, *phdr;
  +#else
  +   Elf32_Ehdr ehdr;
  +   Elf32_Phdr *phdrs, *phdr;
  +#endif
  +   void *dest;
  +   int i;
  +
  +   memcpy(ehdr, output, sizeof(ehdr));
  +   if(ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
  +  ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
  +  ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
  +  ehdr.e_ident[EI_MAG3] != ELFMAG3)
  +   {
  +   error(Kernel is not a valid ELF file);
  +   return;
  +   }
  +
  +   putstr(Parsing ELF... );
  +
  +   phdrs = malloc(sizeof(*phdrs) * ehdr.e_phnum);
  +   if (!phdrs)
  +   error(Failed to allocate space for phdrs);
  +
  +   memcpy(phdrs, output + ehdr.e_phoff, sizeof(*phdrs) * ehdr.e_phnum);
  +
 

Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-04-07 Thread Yinghai Lu
On Sun, Apr 6, 2008 at 10:25 AM, H. Peter Anvin [EMAIL PROTECTED] wrote:

 Yinghai Lu wrote:

  On Sun, Apr 6, 2008 at 9:38 AM, H. Peter Anvin [EMAIL PROTECTED] wrote:
 
   Yinghai Lu wrote:
  
  
so will cost every bzImage extra memory copy? that could be 18M or
even more big.
   
   
I wouldn't worry about that.  You will typically have several copies of
 the
   images during the execution of the boot loader.
  
 
  i put all drivers needed in kernel.
  1. bootloader copy bzImage (6M) to memory
  2. arch/x86/boot/compressed/head_32.S, will copy bzImage to end of
  buffer to do uncompress on possiton.
  3. parse_elf will copy the vmlinux (the uncompressed, that is some big,
 18M)
 
  I suggest that could have special elf header, and will only have one
  PT_LOAD, and avoid the copy, and just offset start address of
  uncompressed kernel for jump later.
 

  Once again, I think you will have a hard time measuring the time
 difference.

ok.

forget about my proposal..

YH
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-04-07 Thread Yinghai Lu
On Sun, Apr 6, 2008 at 9:38 AM, H. Peter Anvin [EMAIL PROTECTED] wrote:
 Yinghai Lu wrote:

  so will cost every bzImage extra memory copy? that could be 18M or
  even more big.
 

  I wouldn't worry about that.  You will typically have several copies of the
 images during the execution of the boot loader.

i put all drivers needed in kernel.
1. bootloader copy bzImage (6M) to memory
2. arch/x86/boot/compressed/head_32.S, will copy bzImage to end of
buffer to do uncompress on possiton.
3. parse_elf will copy the vmlinux (the uncompressed, that is some big, 18M)

I suggest that could have special elf header, and will only have one
PT_LOAD, and avoid the copy, and just offset start address of
uncompressed kernel for jump later.

YH
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: remove overzealous BUG_ON.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 23:50:24 Anthony Liguori wrote:
 Rusty Russell wrote:
  The 'disable_cb' callback is designed as an optimization to tell the host
  we don't need callbacks now.  As it is not reliable, the debug check is
  overzealous: it can happen on two CPUs at the same time.  Document this.
 
  Even if it were reliable, the virtio_net driver doesn't disable
  callbacks on transmit so the START_USE/END_USE debugging reentrance
  protection can be easily tripped even on UP.
 
  Thanks to Balaji Rao for the bug report and testing.

 I think the printk() in start_xmit should be removed too (or at least
 rate limited).

Yep, please send patch.  I'm about to jump on a plane, so consider this a 
pre-canned ack.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio_net: remove overzealous printk

2008-04-07 Thread Anthony Liguori
The 'disable_cb' is really just a hint and as such, it's possible for more
work to get queued up while callbacks are disabled.  Under stress with an
SMP guest, this printk triggers very frequently.  There is no race here, this
is how things are designed to work so let's just remove the printk.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]
Acked-by: Rusty Russell [EMAIL PROTECTED]

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b58472c..d1a200f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -284,7 +284,6 @@ again:
/* Activate callback for using skbs: if this returns false it
 * means some were used in the meantime. */
if (unlikely(!vi-svq-vq_ops-enable_cb(vi-svq))) {
-   printk(Unlikely: restart svq race\n);
vi-svq-vq_ops-disable_cb(vi-svq);
netif_start_queue(dev);
goto again;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Call for Presentations: KVM Forum 2008

2008-04-07 Thread Avi Kivity
[Note: KVM Forum registration is now open at 
http://kforum.qumranet.com/KVMForum/about_kvmforum.php]

This is the Call for Presentations for the second annual KVM Developer's
Forum, to be held on June 10-13, 2008, in Napa, California, USA [1].  We
are looking for presentations on KVM development, quality assurance,
management, security, interoperability, architecture support, and 
interesting use cases.  Presentations are 50 minutes in length; there 
are also 25-minute mini-presentation slots available.

KVM Forum presentations are an excellent way to inform the KVM 
development community about your work, and to gather valuable feedback 
about your approach.

Please send your presentation proposal to the KVM Forum 2008 Content
Committee at [EMAIL PROTECTED] by April 20th.

KVM Forum 2008 Content Committee:
 Dor Laor
 Anthony Liguori
 Avi Kivity

[1] http://kforum.qumranet.com/KVMForum/about_kvmforum.php

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 1/5] vringfd syscall

2008-04-07 Thread Rusty Russell
On Tuesday 08 April 2008 03:54:34 Jonathan Corbet wrote:
 Hey, Rusty,

  For virtualization, we've developed virtio_ring for efficient
  communication. This would also work well for userspace-kernel
  communication, particularly for things like the tun device.  By using the
  same ABI, we can join guests to the host kernel trivially.

 I'm *sure* you meant to document that somewhat non-trivial proposed new
 kernel API as soon as you got a moment.

Actually, yes.  But I wanted to get it out there before I start the treck 
across to the virtualization summit.

A few points:
'The page alignment for the used array is important - that array might be 
mapped separately into kernel space.'
   Well, the used array is written by one side only, so it's possible to split 
the ring here and make each part r/o to the other side.  More importantly, a 
page boundary is almost certainly a cacheline boundary, and we already have a 
userspace interface for it.

'Note that the flags fields in the vring_avail and vring_used structures 
appear to be unused.'
   virtio uses these for wakeup/interrupt suppression.  It's a cheap way to 
avoid hypercalls, and we can use them the same way to avoid system calls (you 
set the suppression bit while you're actually looking at the ring).

The need for the kmap (and hence the atomic horror) has now been alleviated: I 
changed the shinfo destructor code to allow the destructor to hold onto the 
skb data so it can queue it and free it later.

BTW, the only place currently where both output and input buffers are used is 
the virtio_blk driver doing a read, where the header describes the operation, 
and the other buffers are overwritten with the data.

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Call for Presentations: KVM Forum 2008

2008-04-07 Thread Rusty Russell
On Tuesday 08 April 2008 10:17:26 Avi Kivity wrote:
 [Note: KVM Forum registration is now open at
 http://kforum.qumranet.com/KVMForum/about_kvmforum.php]

 This is the Call for Presentations for the second annual KVM Developer's
 Forum, to be held on June 10-13, 2008, in Napa, California, USA [1].

Nak, unf.  I've been told pretty clearly that I'm only doing one trip this 
year, and I'm on it.

I'm sure Anthony to channel my opinions with no effort :)

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 1/5] vringfd syscall

2008-04-07 Thread Arnd Bergmann
On Saturday 05 April 2008, Rusty Russell wrote:
 +asmlinkage long sys_vringfd(void __user *addr,
 +       unsigned num_descs,
 +       u16 __user *last_used)
 +{
 +   int fd, err;
 +   struct file *filp;
 +   struct vring_info *vr;
 +
 +   /* Must be a power of two, and representable by u16 */
 +   if (!num_descs || (num_descs  (num_descs-1)) || num_descs  65536) {
 +   err = -EINVAL;
 +   goto out;
 +   }
 +
 +   fd = get_unused_fd();
 +   if (fd  0) {
 +   err = fd;
 +   goto out;
 +   }
 +
 +   filp = alloc_file(vring_mnt, dget(vring_mnt-mnt_root), FMODE_WRITE,
 +     vring_fops);
 +   if (!filp) {
 +   err = -ENFILE;
 +   goto put_fd;
 +   }

This looks like a candidate for anon_inode_getfd(), which would let you
get rid of the code for registering your own file system.

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 2/5] vringfd base/offset

2008-04-07 Thread Arnd Bergmann
On Saturday 05 April 2008, Rusty Russell wrote:
  static const struct file_operations vring_fops = {
 .release= vring_release,
 .write  = vring_write,
 .poll   = vring_poll,
 +   .ioctl  = vring_ioctl,
  };

This should also set the .compat_ioctl pointer. If the argument 
is defined as an actual memory location instead of a pointer,
you also don't need the compat_ptr() conversion for s390x, so
both can point to the same vring_ioctl function.

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization