EHCI / USB2.0 for USB passthrough, or how to pass USB host device

2010-05-07 Thread Tom Lanyon
Hi list,

I've been playing with some KVM guests on KVM 83 on a RedHat 2.6.18 kernel 
(2.6.18-164.15.1.el5).

I tried to pass through a USB TV tuner device with a hostdev option in the 
guest's configuration. The guest can see the device but the driver 
(dvb_usb_dib0700) refuses to initialise it since it detected QEMU emulating a 
USB 1.1 host and needs USB 2.0:

dvb-usb: This USB2.0 device cannot be run on a USB1.1 port. (it lacks a 
hardware PID filter)

Instead, and as this is the only USB device on the host, I tried to pass 
through the whole USB host controller to the guest via PCI pass through.

There's three functions provided by the USB controller's PCI device:
01:08.0 USB Controller: VIA Technologies, Inc. VT82x UHCI USB 1.1 
Controller (rev 62)
01:08.1 USB Controller: VIA Technologies, Inc. VT82x UHCI USB 1.1 
Controller (rev 62)
01:08.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 65)

so I tried to pass the USB 2.0 (01:08.2) function to the guest but received an 
error when trying to start the guest:

error: this function is not supported by the hypervisor: No PCI reset 
capability available for :01:08.2

I figured this was because I was only trying to pass one function of a 
multi-function device, so I tried passing all three functions concurrently but 
received the same 'PCI reset capability' error.

So, is there a way to emulate a USB 2.0 / EHCI controller in a guest and pass 
my USB device through? or alternatively, can anyone suggest how to get the PCI 
device(s) passed through for the physical USB controller?

Thanks,
Tom--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


patch kvm-remove-unused-load_segment_descriptor_to_kvm_desct.patch added to 2.6.32-stable tree

2010-05-07 Thread gregkh

This is a note to let you know that we have just queued up the patch titled

Subject: KVM: remove unused load_segment_descriptor_to_kvm_desct

to the 2.6.32-stable tree.  Its filename is

kvm-remove-unused-load_segment_descriptor_to_kvm_desct.patch

A git repo of this tree can be found at 

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary


>From mtosa...@redhat.com  Fri May  7 15:13:09 2010
From: Marcelo Tosatti 
Date: Tue, 27 Apr 2010 13:35:26 -0300
Subject: KVM: remove unused load_segment_descriptor_to_kvm_desct
To: Greg KH 
Cc: kvm , Jan Kiszka , 
sta...@kernel.org, Gleb Natapov , Avi Kivity 
Message-ID: <20100427163526.ga25...@amt.cnet>
Content-Disposition: inline

From: Marcelo Tosatti 

Commit 78ce64a384 in v2.6.32.12 introduced a warning due to unused
load_segment_descriptor_to_kvm_desct helper, which has been opencoded by
this commit.

On upstream, the helper was removed as part of a different commit.

Remove the now unused function.

Signed-off-by: Marcelo Tosatti 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/x86/kvm/x86.c |   12 
 1 file changed, 12 deletions(-)

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4155,18 +4155,6 @@ static u16 get_segment_selector(struct k
return kvm_seg.selector;
 }
 
-static int load_segment_descriptor_to_kvm_desct(struct kvm_vcpu *vcpu,
-   u16 selector,
-   struct kvm_segment *kvm_seg)
-{
-   struct desc_struct seg_desc;
-
-   if (load_guest_segment_descriptor(vcpu, selector, &seg_desc))
-   return 1;
-   seg_desct_to_kvm_desct(&seg_desc, selector, kvm_seg);
-   return 0;
-}
-
 static int kvm_load_realmode_segment(struct kvm_vcpu *vcpu, u16 selector, int 
seg)
 {
struct kvm_segment segvar = {


Patches currently in stable-queue which might be from mtosa...@redhat.com are

queue-2.6.32/kvm-remove-unused-load_segment_descriptor_to_kvm_desct.patch
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm mmu: reduce 50% memory usage

2010-05-07 Thread Marcelo Tosatti
On Fri, May 07, 2010 at 04:25:36PM +0800, Lai Jiangshan wrote:
> Subject: [PATCH] kvm, tdp: calculate correct base gfn for non-DIR level
> 
> the base gfn calculation is incorrect in __direct_map(),
> it does not calculate correctly when level=3 or 4.
> 
> Signed-off-by: Lai Jiangshan 
> Reported-by: Marcelo Tosatti 
> ---
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index a5c6719..6986a6f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1955,7 +1956,10 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t 
> v, int write,
>   }
>  
>   if (*iterator.sptep == shadow_trap_nonpresent_pte) {
> - pseudo_gfn = (iterator.addr & PT64_DIR_BASE_ADDR_MASK) 
> >> PAGE_SHIFT;
> + u64 base_addr = iterator.addr;
> +
> + base_addr &= PT64_LVL_ADDR_MASK(iterator.level);
> + pseudo_gfn = base_addr >> PAGE_SHIFT;
>   sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
> iterator.level - 1,
> 1, ACC_ALL, iterator.sptep);

Looks good, please resend the whole patch.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] turn off kvmclock when resetting cpu

2010-05-07 Thread Marcelo Tosatti
On Tue, May 04, 2010 at 02:35:28PM -0400, Glauber Costa wrote:
> Currently, in the linux kernel, we reset kvmclock if we are rebooting
> into a crash kernel through kexec. The rationale, is that a new kernel
> won't follow the same memory addresses, and the memory where kvmclock is
> located in the first kernel, will be something else in the second one.
> 
> We don't do it in normal reboots, because the second kernel ends up
> registering kvmclock again, which has the effect of turning off the
> first instance.
> 
> This is, however, totally wrong. This assumes we're booting into
> a kernel that also has kvmclock enabled. If by some reason we reboot
> into something that doesn't do kvmclock including but not limited to:
>  * rebooting into an older kernel without kvmclock support,
>  * rebooting with no-kvmclock,
>  * rebootint into another O.S,
> 
> we'll simply have the hypervisor writing into a random memory position
> into the guest. Neat, uh?
> 
> Moreover, I believe the fix belongs in qemu, since it is the entity
> more prepared to detect all kinds of reboots (by means of a cpu_reset),
> not to mention the presence of misbehaving guests, that can forget
> to turn kvmclock off.
> 
> This patch fixes the issue for me.
> 
> Signed-off-by: Glauber Costa 
> ---
>  qemu-kvm-x86.c |   19 +++
>  1 files changed, 19 insertions(+), 0 deletions(-)
> 
> diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
> index 439c31a..4b94e04 100644
> --- a/qemu-kvm-x86.c
> +++ b/qemu-kvm-x86.c
> @@ -1417,8 +1417,27 @@ void kvm_arch_push_nmi(void *opaque)
>  }
>  #endif /* KVM_CAP_USER_NMI */
>  
> +static int kvm_turn_off_clock(CPUState *env)
> +{
> +struct {
> +struct kvm_msrs info;
> +struct kvm_msr_entry entries[100];
> +} msr_data;
> +
> +struct kvm_msr_entry *msrs = msr_data.entries;
> +int n = 0;
> +
> +kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME, 0);
> +kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, 0);
> +msr_data.info.nmsrs = n;
> +
> +return kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data);
> +}
> +
> +
>  void kvm_arch_cpu_reset(CPUState *env)
>  {
> +kvm_turn_off_clock(env);

Better to zero env->system_time_msr and wall_clock_msr here, and modify
kvm_arch_load_regs to write those MSRs on KVM_PUT_RESET_STATE (to be
conformant with the new writeback scheme).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2998292 ] XP Guest Crash on Ubuntu 10.04

2010-05-07 Thread SourceForge.net
Bugs item #2998292, was opened at 2010-05-07 18:28
Message generated for change (Comment added) made by 
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2998292&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
>Priority: 7
Private: No
Submitted By: cheesewright ()
Assigned to: Nobody/Anonymous (nobody)
Summary: XP Guest Crash on Ubuntu 10.04

Initial Comment:
OS: Ubuntu Server 64-bit 10.04, kernel 2.6.32
Server: Supermicro X8DTN+ 2x Xeon E5530 2.4GHz Nehalem quad core cpu, 12GB RAM
Guest: Windows XP SP3 32-bit

Below is my kvm command line:

/usr/bin/kvm -S -M pc-0.11 -enable-kvm -m 1024 -smp 2 -name data2 -uuid 
9a85ee7a-4db9-9108-df3c-61d04d3c943d -chardev 
socket,id=monitor,path=/var/lib/libvirt/qemu/data2.monitor,server,nowait 
-monitor chardev:monitor -localtime -boot c -drive 
file=/san/data2/winxp.img,if=ide,index=0,boot=on -drive 
file=/dev/cdrom,if=ide,media=cdrom,index=2 -net 
nic,macaddr=52:54:00:38:a8:af,vlan=0,name=nic.0 -net 
tap,fd=62,vlan=0,name=tap.0 -chardev pty,id=serial0 -serial chardev:serial0 
-parallel none -usb -usbdevice tablet -vnc 0.0.0.0:2,password -k en-us -vga 
cirrus

I have two XP SP3 guests which crash frequently on random events when 
connecting via remote desktop. The Windows Event Log reports Event ID 1003. 
These errors result in either a system freeze or full reboot.

There are several types of errors recorded:

Event Source:   System Error
Event Category: (102)
Event ID:   1003
Error code 0005, parameter1 04c0, parameter2 8653a6c8, parameter3 
, parameter4 .

Event Source:   System Error
Event Category: (102)
Event ID:   1003
Error code 000a, parameter1 3008, parameter2 0002, parameter3 
, parameter4 804ff806.

Event Source:   System Error
Event Category: (102)
Event ID:   1003
Error code 100a, parameter1 3008, parameter2 0002, parameter3 
, parameter4 804ff806.


--

>Comment By: cheesewright ()
Date: 2010-05-07 18:32

Message:
The crash also happens when using kvm -M pc-0.12. It is easily reproducible
after clicking through windows such control panel apps, or web browsers,
for a few minutes.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2998292&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2998292 ] XP Guest Crash on Ubuntu 10.04

2010-05-07 Thread SourceForge.net
Bugs item #2998292, was opened at 2010-05-07 18:28
Message generated for change (Tracker Item Submitted) made by 
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2998292&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: cheesewright ()
Assigned to: Nobody/Anonymous (nobody)
Summary: XP Guest Crash on Ubuntu 10.04

Initial Comment:
OS: Ubuntu Server 64-bit 10.04, kernel 2.6.32
Server: Supermicro X8DTN+ 2x Xeon E5530 2.4GHz Nehalem quad core cpu, 12GB RAM
Guest: Windows XP SP3 32-bit

Below is my kvm command line:

/usr/bin/kvm -S -M pc-0.11 -enable-kvm -m 1024 -smp 2 -name data2 -uuid 
9a85ee7a-4db9-9108-df3c-61d04d3c943d -chardev 
socket,id=monitor,path=/var/lib/libvirt/qemu/data2.monitor,server,nowait 
-monitor chardev:monitor -localtime -boot c -drive 
file=/san/data2/winxp.img,if=ide,index=0,boot=on -drive 
file=/dev/cdrom,if=ide,media=cdrom,index=2 -net 
nic,macaddr=52:54:00:38:a8:af,vlan=0,name=nic.0 -net 
tap,fd=62,vlan=0,name=tap.0 -chardev pty,id=serial0 -serial chardev:serial0 
-parallel none -usb -usbdevice tablet -vnc 0.0.0.0:2,password -k en-us -vga 
cirrus

I have two XP SP3 guests which crash frequently on random events when 
connecting via remote desktop. The Windows Event Log reports Event ID 1003. 
These errors result in either a system freeze or full reboot.

There are several types of errors recorded:

Event Source:   System Error
Event Category: (102)
Event ID:   1003
Error code 0005, parameter1 04c0, parameter2 8653a6c8, parameter3 
, parameter4 .

Event Source:   System Error
Event Category: (102)
Event ID:   1003
Error code 000a, parameter1 3008, parameter2 0002, parameter3 
, parameter4 804ff806.

Event Source:   System Error
Event Category: (102)
Event ID:   1003
Error code 100a, parameter1 3008, parameter2 0002, parameter3 
, parameter4 804ff806.


--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2998292&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Reminder] KVM Forum 2010: Call for Papers

2010-05-07 Thread KVM Forum 2010 Program Committee
Just a reminder...The submission deadline is in one week.

thanks,
-KVM Forum 2010 Program Commitee
--

=

CALL FOR PAPERS 

KVM Forum 2010

=


DESCRIPTION

 The KVM Forum is back! After a break last year we're proud to present
 this year's gathering around KVM again.  The idea is to have everyone
 involved with KVM development come together to talk about the future
 and current state of KVM, teaching everyone some pieces of the puzzle
 they might be missing without.
 
 So if you're a KVM developer, mark the dates in your calendar! If
 possible, also submit a talk -- we're interested in a wide variety of
 KVM topics, so don't hesitate to propose a talk on your work.

 If you're not a KVM developer, please read on nevertheless (or jump to
 END USER COLLABORATION).
 

DATES / LOCATION

 Conference: August 9 - 10, 2010
 Location: Renaissance Boston Waterfront in Boston, MA
 
 Abstracts due: May 14th, 2010
 Notification: May 28th, 2010
 
 Yes, we're colocated with LinuxCon.  Tickets for the KVM Forum also count
 for LinuxCon.

 
http://events.linuxfoundation.org/component/registrationpro/?func=details&did=34
 
 
PROCESS

 At first check if it's before May 14th.  If you're past that date, you're
 out of luck.  Now try to think hard and come up with a great idea that
 you could talk about.  Once you have that set, we need you to write up
 a short abstract (~150 words) on it.  In your submission please note
 how long your talk will take.  Slots vary in length up to one hour.
 Also include in your proposal the proposal type -- one of: technical talk,
 breakout session, or end-user talk.  Add that information to the abstract
 and submit it at the following URL:
 
 http://events.linuxfoundation.org/cfp/cfp-add
 
 Now, wait until May 24th.  You will receive a notification on whether
 your talk was accepted or not.
 
 
SCOPE OF TALKS

 We have a list of suggested presentation topics below.  These suggestions
 are just for guidance, please feel free to submit a proposal on any
 of these or related topics.  In general, the more it's about backend
 infrastructure, the better.
 
 KVM
   - Scaling and performance
   - Nested virtualization
   - I/O improvements
   - Driver domains
   - Time keeping
   - Memory management (page sharing, swapping, huge pages, etc)
   - Fault tolerance
   - VEPA, vswitch

  Embedded KVM
   - KVM on ARM, PPC, MIPS, ...?
   - Real-time requirements host/guest
   - Device pass-through w/o iommu
   - Custom device/platform models
 
 QEMU
   - Device model improvements
   - New devices
   - Security model
   - Scaling and performance
   - Desktop virtualization
   - Increasing robustness
   - Management interfaces
   - QMP protocol and implementation
   - Live migration
 
 Virtio
   - Speeding up existing devices
   - Vhost
   - Alternatives
   - Using virtio in non-kvm environments
   - Virtio on non-Linux
 
 Management infrastructure
   - Libvirt
   - Kvm autotest
   - Easy networking
   - Qemud


BREAKOUT SESSION

 We will reserve some time each day to break out for working sessions.
 These sessions will be less formal than a presentation and more focused
 on developing a solution to some real development issue.  If you are
 interested in getting developers together to hack on some code, submit
 your proposal and just make it clear it's a breakout session proposal.
 

END-USER COLLABORATION

 One of the big challenges as developers is to know what, where and how
 people actually use our software.  To solve this issue at least a little,
 there will be a few slots reserved for end users talking about their
 deployments, problems and achievements.
 
 So if you have a KVM based deployment running in production or are about
 to roll out one, please also submit a talk (see PROCESS), and simply
 mark it asn an end-user collaboration proposal.  We would love to have
 an open discussion of fields where KVM/Qemu can still improve and you
 would have the unique chance to steer that process!
  
 Keep in mind that most of the Forum will be focused on development though,
 so we suggest you also come with a good portion of technical interest :-).
 And of course, no product marketing please!  The purpose is to engage
 with KVM developers.
 
 
LIGHTNING TALKS

 In addition to submitted talks we will also have some room for lightning
 talks. So if you have something you think might be done until the KVM
 Forum, but you're not sure you could fill 15 minutes with it. Or if you
 don't know if you'll make it until there, just keep in mind that you
 will still get the chance to talk about it.  Lightning talk submissions
 and scheduling will be handled on-site at KVM Forum.


Thank you for your interest in KVM.  We're looking forward to your
submissions and seeing you at the KVM Forum 2010 in August!  Now, start
thinking about that talk you want to give.

Thanks,
your KVM Forum 2010 Program 

Re: Endless loop in qcow2_alloc_cluster_offset

2010-05-07 Thread Marcelo Tosatti
On Fri, May 07, 2010 at 09:37:22AM +0200, Kevin Wolf wrote:
> Am 07.05.2010 03:19, schrieb Marcelo Tosatti:
> > On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
> >> Hi,
> >>
> >> I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
> >> endless loop in qcow2_alloc_cluster_offset, namely over
> >> QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight):
> >>
> >> (gdb) bt
> >> #0  0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, 
> >> offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at 
> >> /data/qemu-kvm/block/qcow2-cluster.c:750
> >> #1  0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at 
> >> /data/qemu-kvm/block/qcow2.c:587
> >> #2  0x00482a44 in qcow_aio_writev (bs=, 
> >> sector_num=, qiov=, 
> >> nb_sectors=, cb=, opaque= >> optimized out>) at /data/qemu-kvm/block/qcow2.c:645
> >> #3  0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, 
> >> qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 , 
> >> opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
> >> #4  0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, 
> >> buf=0xd67200 "H\a", nb_sectors=16) at /data/qemu-kvm/block.c:1736
> >> #5  0x00435581 in ide_sector_write (s=0xc92650) at 
> >> /data/qemu-kvm/hw/ide/core.c:622
> >> #6  0x00425fc2 in kvm_handle_io (env=) at 
> >> /data/qemu-kvm/kvm-all.c:553
> >> #7  kvm_run (env=) at /data/qemu-kvm/qemu-kvm.c:964
> >> #8  0x00426049 in kvm_cpu_exec (env=0x1000) at 
> >> /data/qemu-kvm/qemu-kvm.c:1651
> >> #9  0x0042627d in kvm_main_loop_cpu (_env=) 
> >> at /data/qemu-kvm/qemu-kvm.c:1893
> >> #10 ap_main_loop (_env=) at 
> >> /data/qemu-kvm/qemu-kvm.c:1943
> >> #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0
> >> #12 0x7f48abf0711d in clone () from /lib64/libc.so.6
> >> #13 0x in ?? ()
> >> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> >> $5 = (struct QCowL2Meta *) 0xcb3568
> >> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> >> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 
> >> 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, 
> >> next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
> >>
> >> So next == first.
> >>
> > 
> > Seen the exact same bug twice in a row while installing FC12 with IDE
> > disk, current qemu-kvm.git. 
> > 
> > qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
> > -m 1000  -vnc :1 \
> > -net nic,model=virtio \
> > -net tap,script=/root/ifup.sh -serial stdio \
> > -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
> > telnet::4445,server,nowait -usbdevice tablet
> > 
> > Can't reproduce though.
> 
> In current git master? That's interesting news. I had kind of expected
> it would be fixed with c644db3d.

Yes, with 31b460256 more precisely. And the symptom was the same as Jan
reported, cluster_allocs.lh_first had le_next pointing to itself.

Perhaps you can add an assert there, so it abort()'s in that case along
with some useful information? I'll try to reproduce.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest][PATCH] KVM Test: Make remote_scp() more robust.

2010-05-07 Thread Lucas Meneghel Rodrigues
On Fri, 2010-05-07 at 18:26 +0800, Feng Yang wrote:
> 1. In remote_scp(), if SCP connetion stalled for some reason, following
> code will be ran.
> else:  # match == None
> 
> logging.debug("Timeout elapsed or process terminated")
> status = sub.get_status()
> sub.close()
> return status == 0
> At this moment, kvm_subprocess server is still running which means
> lock_server_running_filename is still locked. But sub.get_status()
> tries to lock it again.  If kvm_subprocess server keeps running,
> a deadlock will happen. This patch will fix this issue by enable
> timeout parameter. Update default value for timeout to 600, it should
> be enough.
> 
> 2. Add "-v" in scp command to catch more infomation. Also add "Exit status"
> and "stalled" match prompt in remote_scp().

Excellent. I've made some slight changes to message wording, and
commited it:

http://autotest.kernel.org/changeset/4483

Thank you very much!

> Signed-off-by: Feng Yang 
> ---
>  client/tests/kvm/kvm_utils.py |   36 
>  client/tests/kvm/kvm_vm.py|4 ++--
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
> index 25f3c8c..3db4dec 100644
> --- a/client/tests/kvm/kvm_utils.py
> +++ b/client/tests/kvm/kvm_utils.py
> @@ -524,7 +524,7 @@ def remote_login(command, password, prompt, linesep="\n", 
> timeout=10):
>  return None
>  
> 
> -def remote_scp(command, password, timeout=300, login_timeout=10):
> +def remote_scp(command, password, timeout=600, login_timeout=10):
>  """
>  Run the given command using kvm_spawn and provide answers to the 
> questions
>  asked. If timeout expires while waiting for the transfer to complete ,
> @@ -548,12 +548,18 @@ def remote_scp(command, password, timeout=300, 
> login_timeout=10):
>  
>  password_prompt_count = 0
>  _timeout = login_timeout
> +end_time = time.time() + timeout
> +logging.debug("Trying to SCP...")
>  
> -logging.debug("Trying to login...")
>  
>  while True:
> +if end_time <= time.time():
> +logging.debug("transfer timeout!")
> +sub.close()
> +return False
>  (match, text) = sub.read_until_last_line_matches(
> -[r"[Aa]re you sure", r"[Pp]assword:\s*$", r"lost 
> connection"],
> +[r"[Aa]re you sure", r"[Pp]assword:\s*$", r"lost connection",
> + r"Exit status", r"stalled"],
>  timeout=_timeout, internal_timeout=0.5)
>  if match == 0:  # "Are you sure you want to continue connecting"
>  logging.debug("Got 'Are you sure...'; sending 'yes'")
> @@ -574,15 +580,29 @@ def remote_scp(command, password, timeout=300, 
> login_timeout=10):
>  logging.debug("Got 'lost connection'")
>  sub.close()
>  return False
> +elif match == 3: # "Exit status"
> +sub.close()
> +if "Exit status 0" in text:
> +logging.debug("SCP command completed successfully")
> +return True
> +else:
> +logging.debug("SCP command fail with exit status %s" % text)
> +return False
> +elif match == 4: # "stalled"
> +logging.debug("SCP connection stalled for some reason")
> +continue
> +
>  else:  # match == None
> -logging.debug("Timeout elapsed or process terminated")
> +if sub.is_alive():
> +continue
> +logging.debug("Process terminated for some reason")
>  status = sub.get_status()
>  sub.close()
>  return status == 0
>  
> 
>  def scp_to_remote(host, port, username, password, local_path, remote_path,
> -  timeout=300):
> +  timeout=600):
>  """
>  Copy files to a remote host (guest).
>  
> @@ -596,14 +616,14 @@ def scp_to_remote(host, port, username, password, 
> local_path, remote_path,
>  
>  @return: True on success and False on failure.
>  """
> -command = ("scp -o UserKnownHostsFile=/dev/null "
> +command = ("scp -v -o UserKnownHostsFile=/dev/null "
> "-o PreferredAuthentications=password -r -P %s %s %...@%s:%s" 
> %
> (port, local_path, username, host, remote_path))
>  return remote_scp(command, password, timeout)
>  
> 
>  def scp_from_remote(host, port, username, password, remote_path, local_path,
> -timeout=300):
> +timeout=600):
>  """
>  Copy files from a remote host (guest).
>  
> @@ -617,7 +637,7 @@ def scp_from_remote(host, port, username, password, 
> remote_path, local_path,
>  
>  @return: True on success and False on failure.
>  """
> -command = ("scp -o UserKnownHostsFile=/dev/null "
> +command = ("scp -v -o UserKnownHostsFile=/dev/null "
>

Re: [Autotest] [PATCH 1/3] KVM test: Use customized command to get the version of kvm and its

2010-05-07 Thread Jason Wang

- "Lucas Meneghel Rodrigues"  wrote:

> On Fri, 2010-05-07 at 18:10 +0800, Jason Wang wrote:
> > Lucas Meneghel Rodrigues wrote:
> > > On Mon, Apr 26, 2010 at 7:07 AM, Jason Wang 
> wrote:
> > >   
> > >> userspace
> > >>
> > >> Current method may or may not work for various kinds of
> > >> distribution. So this patch enable the ability to use customized
> > >> commands to get the version of kvm and its userspace.
> "kvm_ver_cmd" is
> > >> used for kvm verison and "kvm_userspace_ver_cmd" is for its
> userspace.
> > >> 
> > >
> > > The method we are currently using is pretty satisfactory - if we
> fail
> > > in getting /sys/module/kvm/version we use the kernel version as a
> > > fallback, which is good for the kernel module. For qemu, we make
> a
> > > regular expression searching for numbers following the string
> version,
> > > so I don't see a reason on why we should make it configurable.
> Care to
> > > provide an example of a situation where the current method fails?
> > >
> > >   
> > Current method may be not as accurate as we expected.
> > In my Fedora box, the output of qemu-kvm -h | head -n 1 is something
> like:
> > QEMU PC emulator version 0.9.1 (kvm-83-maint-snapshot-20090205), 
> > Copyright (c) 2003-2008 Fabrice Bellard
> > but the rpm -qa may tell more accurate version:
> > qemu-kvm-0.11.0-13.fc12.x86_64
> 
> The above version of qemu looks like the one shipped in RHEL 5.X, not
> Fedora, you might have mistaken the versions. Here is what it looks
> on:
> 
> Fedora 11:
> 
> [r...@localhost ~]# qemu-kvm -help | head -1
> QEMU PC emulator version 0.11.0 (qemu-kvm-0.11.0), Copyright (c)
> 2003-2008 Fabrice Bellard
> [r...@localhost ~]# rpm -qa | grep qemu-kvm
> qemu-kvm-0.11.0-13.fc12.x86_64
> 
> Fedora 13:
> 
> [...@freedom ~]$ qemu-kvm -h | head -1
> QEMU PC emulator version 0.12.3 (qemu-kvm-0.12.3), Copyright (c)
> 2003-2008 Fabrice Bellard
> [...@freedom ~]$ rpm -qa | grep qemu-kvm
> qemu-kvm-0.12.3-8.fc13.x86_64
> 
> Moreover, we deal with several build methods, qemu-kvm might have not
> be
> installed through rpm, so we have to use a single method to figure
> out
> the versions. Another point is that, if we run such alternate methods
> (such as git build, or brew build) we will have reliable versioning
> that
> can be extracted from the build logs.
> 
> My decision is we keep the current method of determining the version.
> The current method we have is fairly reliable (though obviously not
> perfect) in my opinion, and it can be applied pretty much for all
> branches.
> 
> I would really like that we start embedding version control (git)
> information somewhere installing binaries, to make things easier for
> people bisecting issues, but not sure what the maintainers would
> think
> about this.
> 
Thanks for the detailed explanation here, I think I've missed the possibility 
of building form source.
I agree that we keep the current method.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH 1/3] KVM test: Use customized command to get the version of kvm and its

2010-05-07 Thread Lucas Meneghel Rodrigues
On Fri, 2010-05-07 at 18:10 +0800, Jason Wang wrote:
> Lucas Meneghel Rodrigues wrote:
> > On Mon, Apr 26, 2010 at 7:07 AM, Jason Wang  wrote:
> >   
> >> userspace
> >>
> >> Current method may or may not work for various kinds of
> >> distribution. So this patch enable the ability to use customized
> >> commands to get the version of kvm and its userspace. "kvm_ver_cmd" is
> >> used for kvm verison and "kvm_userspace_ver_cmd" is for its userspace.
> >> 
> >
> > The method we are currently using is pretty satisfactory - if we fail
> > in getting /sys/module/kvm/version we use the kernel version as a
> > fallback, which is good for the kernel module. For qemu, we make a
> > regular expression searching for numbers following the string version,
> > so I don't see a reason on why we should make it configurable. Care to
> > provide an example of a situation where the current method fails?
> >
> >   
> Current method may be not as accurate as we expected.
> In my Fedora box, the output of qemu-kvm -h | head -n 1 is something like:
> QEMU PC emulator version 0.9.1 (kvm-83-maint-snapshot-20090205), 
> Copyright (c) 2003-2008 Fabrice Bellard
> but the rpm -qa may tell more accurate version:
> qemu-kvm-0.11.0-13.fc12.x86_64

The above version of qemu looks like the one shipped in RHEL 5.X, not
Fedora, you might have mistaken the versions. Here is what it looks on:

Fedora 11:

[r...@localhost ~]# qemu-kvm -help | head -1
QEMU PC emulator version 0.11.0 (qemu-kvm-0.11.0), Copyright (c) 2003-2008 
Fabrice Bellard
[r...@localhost ~]# rpm -qa | grep qemu-kvm
qemu-kvm-0.11.0-13.fc12.x86_64

Fedora 13:

[...@freedom ~]$ qemu-kvm -h | head -1
QEMU PC emulator version 0.12.3 (qemu-kvm-0.12.3), Copyright (c) 2003-2008 
Fabrice Bellard
[...@freedom ~]$ rpm -qa | grep qemu-kvm
qemu-kvm-0.12.3-8.fc13.x86_64

Moreover, we deal with several build methods, qemu-kvm might have not be
installed through rpm, so we have to use a single method to figure out
the versions. Another point is that, if we run such alternate methods
(such as git build, or brew build) we will have reliable versioning that
can be extracted from the build logs.

My decision is we keep the current method of determining the version.
The current method we have is fairly reliable (though obviously not
perfect) in my opinion, and it can be applied pretty much for all
branches.

I would really like that we start embedding version control (git)
information somewhere installing binaries, to make things easier for
people bisecting issues, but not sure what the maintainers would think
about this.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Autotest][PATCH] KVM Test: Make remote_scp() more robust.

2010-05-07 Thread Feng Yang
1. In remote_scp(), if SCP connetion stalled for some reason, following
code will be ran.
else:  # match == None

logging.debug("Timeout elapsed or process terminated")
status = sub.get_status()
sub.close()
return status == 0
At this moment, kvm_subprocess server is still running which means
lock_server_running_filename is still locked. But sub.get_status()
tries to lock it again.  If kvm_subprocess server keeps running,
a deadlock will happen. This patch will fix this issue by enable
timeout parameter. Update default value for timeout to 600, it should
be enough.

2. Add "-v" in scp command to catch more infomation. Also add "Exit status"
and "stalled" match prompt in remote_scp().

Signed-off-by: Feng Yang 
---
 client/tests/kvm/kvm_utils.py |   36 
 client/tests/kvm/kvm_vm.py|4 ++--
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
index 25f3c8c..3db4dec 100644
--- a/client/tests/kvm/kvm_utils.py
+++ b/client/tests/kvm/kvm_utils.py
@@ -524,7 +524,7 @@ def remote_login(command, password, prompt, linesep="\n", 
timeout=10):
 return None
 
 
-def remote_scp(command, password, timeout=300, login_timeout=10):
+def remote_scp(command, password, timeout=600, login_timeout=10):
 """
 Run the given command using kvm_spawn and provide answers to the questions
 asked. If timeout expires while waiting for the transfer to complete ,
@@ -548,12 +548,18 @@ def remote_scp(command, password, timeout=300, 
login_timeout=10):
 
 password_prompt_count = 0
 _timeout = login_timeout
+end_time = time.time() + timeout
+logging.debug("Trying to SCP...")
 
-logging.debug("Trying to login...")
 
 while True:
+if end_time <= time.time():
+logging.debug("transfer timeout!")
+sub.close()
+return False
 (match, text) = sub.read_until_last_line_matches(
-[r"[Aa]re you sure", r"[Pp]assword:\s*$", r"lost connection"],
+[r"[Aa]re you sure", r"[Pp]assword:\s*$", r"lost connection",
+ r"Exit status", r"stalled"],
 timeout=_timeout, internal_timeout=0.5)
 if match == 0:  # "Are you sure you want to continue connecting"
 logging.debug("Got 'Are you sure...'; sending 'yes'")
@@ -574,15 +580,29 @@ def remote_scp(command, password, timeout=300, 
login_timeout=10):
 logging.debug("Got 'lost connection'")
 sub.close()
 return False
+elif match == 3: # "Exit status"
+sub.close()
+if "Exit status 0" in text:
+logging.debug("SCP command completed successfully")
+return True
+else:
+logging.debug("SCP command fail with exit status %s" % text)
+return False
+elif match == 4: # "stalled"
+logging.debug("SCP connection stalled for some reason")
+continue
+
 else:  # match == None
-logging.debug("Timeout elapsed or process terminated")
+if sub.is_alive():
+continue
+logging.debug("Process terminated for some reason")
 status = sub.get_status()
 sub.close()
 return status == 0
 
 
 def scp_to_remote(host, port, username, password, local_path, remote_path,
-  timeout=300):
+  timeout=600):
 """
 Copy files to a remote host (guest).
 
@@ -596,14 +616,14 @@ def scp_to_remote(host, port, username, password, 
local_path, remote_path,
 
 @return: True on success and False on failure.
 """
-command = ("scp -o UserKnownHostsFile=/dev/null "
+command = ("scp -v -o UserKnownHostsFile=/dev/null "
"-o PreferredAuthentications=password -r -P %s %s %...@%s:%s" %
(port, local_path, username, host, remote_path))
 return remote_scp(command, password, timeout)
 
 
 def scp_from_remote(host, port, username, password, remote_path, local_path,
-timeout=300):
+timeout=600):
 """
 Copy files from a remote host (guest).
 
@@ -617,7 +637,7 @@ def scp_from_remote(host, port, username, password, 
remote_path, local_path,
 
 @return: True on success and False on failure.
 """
-command = ("scp -o UserKnownHostsFile=/dev/null "
+command = ("scp -v -o UserKnownHostsFile=/dev/null "
"-o PreferredAuthentications=password -r -P %s %...@%s:%s %s" %
(port, username, host, remote_path, local_path))
 return remote_scp(command, password, timeout)
diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 6bc7987..d1e0246 100755
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -808,7 +808,7 @@ class VM:
 return session
 
 
-def copy_file

Re: [Autotest] [PATCH 1/3] KVM test: Use customized command to get the version of kvm and its

2010-05-07 Thread Jason Wang

Lucas Meneghel Rodrigues wrote:

On Mon, Apr 26, 2010 at 7:07 AM, Jason Wang  wrote:
  

userspace

Current method may or may not work for various kinds of
distribution. So this patch enable the ability to use customized
commands to get the version of kvm and its userspace. "kvm_ver_cmd" is
used for kvm verison and "kvm_userspace_ver_cmd" is for its userspace.



The method we are currently using is pretty satisfactory - if we fail
in getting /sys/module/kvm/version we use the kernel version as a
fallback, which is good for the kernel module. For qemu, we make a
regular expression searching for numbers following the string version,
so I don't see a reason on why we should make it configurable. Care to
provide an example of a situation where the current method fails?

  

Current method may be not as accurate as we expected.
In my Fedora box, the output of qemu-kvm -h | head -n 1 is something like:
QEMU PC emulator version 0.9.1 (kvm-83-maint-snapshot-20090205), 
Copyright (c) 2003-2008 Fabrice Bellard

but the rpm -qa may tell more accurate version:
qemu-kvm-0.11.0-13.fc12.x86_64

Signed-off-by: Jason Wang 
---
 client/tests/kvm/kvm_preprocessing.py |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/client/tests/kvm/kvm_preprocessing.py 
b/client/tests/kvm/kvm_preprocessing.py
index 4b9290c..16200ab 100644
--- a/client/tests/kvm/kvm_preprocessing.py
+++ b/client/tests/kvm/kvm_preprocessing.py
@@ -225,10 +225,10 @@ def preprocess(test, params, env):
# Get the KVM kernel module version and write it as a keyval
logging.debug("Fetching KVM module version...")
if os.path.exists("/dev/kvm"):
-try:
-kvm_version = open("/sys/module/kvm/version").read().strip()
-except:
-kvm_version = os.uname()[2]
+kvm_ver_cmd = params.get("kvm_ver_cmd", "cat /sys/module/kvm/version")
+s, kvm_version = commands.getstatusoutput(kvm_ver_cmd)
+if s != 0:
+kvm_version = "Unknown"
else:
kvm_version = "Unknown"
logging.debug("KVM module not loaded")
@@ -239,11 +239,11 @@ def preprocess(test, params, env):
logging.debug("Fetching KVM userspace version...")
qemu_path = kvm_utils.get_path(test.bindir, params.get("qemu_binary",
   "qemu"))
-version_line = commands.getoutput("%s -help | head -n 1" % qemu_path)
-matches = re.findall("[Vv]ersion .*?,", version_line)
-if matches:
-kvm_userspace_version = " ".join(matches[0].split()[1:]).strip(",")
-else:
+def_qemu_ver_cmd = "%s -help | head -n 1 | awk '{ print $5}'" % qemu_path
+kvm_userspace_ver_cmd = params.get("kvm_userspace_ver_cmd",
+   def_qemu_ver_cmd)
+s, kvm_userspace_version = commands.getstatusoutput(kvm_userspace_ver_cmd)
+if s != 0:
kvm_userspace_version = "Unknown"
logging.debug("Could not fetch KVM userspace version")
logging.debug("KVM userspace version: %s" % kvm_userspace_version)

___
Autotest mailing list
autot...@test.kernel.org
http://test.kernel.org/cgi-bin/mailman/listinfo/autotest






  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 01/19] Add a new structure for skb buffer from external.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/skbuff.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..cf309c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,18 @@ struct skb_shared_info {
void *  destructor_arg;
 };
 
+/* The structure is for a skb which skb->data may point to
+ * an external buffer, which is not allocated from kernel space.
+ * Since the buffer is external, then the shinfo or frags are
+ * also extern too. It also contains a destructor for itself.
+ */
+struct skb_external_page {
+   u8  *start;
+   int size;
+   struct skb_frag_struct *frags;
+   struct skb_shared_info *ushinfo;
+   void(*dtor)(struct skb_external_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 03/19] Export 2 func for device to assign/deassign new strucure

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/netdevice.h |3 +++
 net/core/dev.c|   28 
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bae725c..efb575a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1592,6 +1592,9 @@ extern gro_result_t   napi_frags_finish(struct 
napi_struct *napi,
  gro_result_t ret);
 extern struct sk_buff *napi_frags_skb(struct napi_struct *napi);
 extern gro_result_tnapi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_attach(struct net_device *dev,
+struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index f769098..ecbb6b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
 }
 
+/* Export two functions to assign/de-assign mp_port pointer
+ * to a net device.
+ */
+
+int netdev_mp_port_attach(struct net_device *dev,
+   struct mpassthru_port *port)
+{
+   /* locked by mp_mutex */
+   if (rcu_dereference(dev->mp_port))
+   return -EBUSY;
+
+   rcu_assign_pointer(dev->mp_port, port);
+
+   return 0;
+}
+EXPORT_SYMBOL(netdev_mp_port_attach);
+
+void netdev_mp_port_detach(struct net_device *dev)
+{
+   /* locked by mp_mutex */
+   if (!rcu_dereference(dev->mp_port))
+   return;
+
+   rcu_assign_pointer(dev->mp_port, NULL);
+   synchronize_rcu();
+}
+EXPORT_SYMBOL(netdev_mp_port_detach);
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 04/19] Add a ndo_mp_port_prep pointer to net_device_ops.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/netdevice.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index efb575a..183c786 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -707,6 +707,10 @@ struct net_device_ops {
int (*ndo_fcoe_get_wwn)(struct net_device *dev,
u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+   int (*ndo_mp_port_prep)(struct net_device *dev,
+   struct mpassthru_port *port);
+#endif
 };
 
 /*
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 06/19] Add a function to indicate if device use external buffer.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/netdevice.h |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 31d9c4a..0cb78f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern void netdev_mp_port_detach(struct net_device 
*dev);
 extern int netdev_mp_port_prep(struct net_device *dev,
struct mpassthru_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+   return (dev && dev->mp_port);
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
kfree_skb(napi->skb);
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 10/19] Don't do skb recycle, if device use external buffer.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 net/core/skbuff.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ae223d2..169f22c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -553,6 +553,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
if (skb_shared(skb) || skb_cloned(skb))
return 0;
 
+   /* if the device wants to do mediate passthru, the skb may
+* get external buffer, so don't recycle
+*/
+   if (dev_is_mpassthru(skb->dev))
+   return 0;
+
skb_release_head_state(skb);
shinfo = skb_shinfo(skb);
atomic_set(&shinfo->dataref, 1);
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 11/19] Use callback to deal with skb_release_data() specially.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

If buffer is external, then use the callback to destruct
buffers.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 net/core/skbuff.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 169f22c..5d93b2d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 
 static void skb_release_data(struct sk_buff *skb)
 {
+   /* check if the skb has external buffers, we have use destructor_arg
+* here to indicate
+*/
+   struct skb_external_page *ext_page = skb_shinfo(skb)->destructor_arg;
+
if (!skb->cloned ||
!atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
   &skb_shinfo(skb)->dataref)) {
@@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
if (skb_has_frags(skb))
skb_drop_fraglist(skb);
 
+   /* if the skb has external buffers, use destructor here,
+* since after that skb->head will be kfree, in case skb->head
+* from external buffer cannot use kfree to destroy.
+*/
+   if (dev_is_mpassthru(skb->dev) && ext_page && ext_page->dtor)
+   ext_page->dtor(ext_page);
kfree(skb->head);
}
 }
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 12/19] Add a hook to intercept external buffers from NIC driver.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 net/core/dev.c |   35 +++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 37b389a..dc2f225 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2548,6 +2548,37 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+  struct packet_type **pt_prev,
+  int *ret,
+  struct net_device *orig_dev)
+{
+   struct mpassthru_port *mp_port = NULL;
+   struct sock *sk = NULL;
+
+   if (!dev_is_mpassthru(skb->dev))
+   return skb;
+   mp_port = skb->dev->mp_port;
+
+   if (*pt_prev) {
+   *ret = deliver_skb(skb, *pt_prev, orig_dev);
+   *pt_prev = NULL;
+   }
+
+   sk = mp_port->sock->sk;
+   skb_queue_tail(&sk->sk_receive_queue, skb);
+   sk->sk_state_change(sk);
+
+   return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
@@ -2629,6 +2660,10 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+   /* To intercept mediate passthru(zero-copy) packets here */
+   skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+   if (!skb)
+   goto out;
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 13/19] To skip GRO if buffer is external currently.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 net/core/dev.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index dc2f225..6c6b2fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2787,6 +2787,10 @@ enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff *skb)
if (skb_is_gso(skb) || skb_has_frags(skb))
goto normal;
 
+   /* currently GRO is not supported by mediate passthru */
+   if (dev_is_mpassthru(skb->dev))
+   goto normal;
+
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype->type != type || ptype->dev || !ptype->gro_receive)
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 15/19] Add basic funcs and ioctl to mp device.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

The ioctl is used by mp device to bind an underlying
NIC, it will query hardware capability and declare the
NIC to use external buffers.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---

memory leak fixed,
kconfig made,
do_unbind() made,
mp_chr_ioctl() cleanup

by Jeff Dike 

 drivers/vhost/mpassthru.c |  682 +
 1 files changed, 682 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..33cc123
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,682 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME"mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the external buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a external allocated skb or kernel
+*/
+   struct skb_external_pageext_page;
+   struct skb_shared_info  ushinfo;
+
+#define INFO_READ  0
+#define INFO_WRITE 1
+   unsignedflags;
+   unsignedpnum;
+
+   /* It's meaningful for receive, means
+* the max length allowed
+*/
+   size_t  len;
+
+   /* The fields after that is for backend
+* driver, now for vhost-net.
+*/
+
+   struct kiocb*iocb;
+   unsigned intdesc_pos;
+   struct iovechdr[MAX_SKB_FRAGS + 2];
+   struct ioveciov[MAX_SKB_FRAGS + 2];
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int wq_len;
+   int rq_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct mp_struct {
+   struct mp_file  *mfile;
+   struct net_device   *dev;
+   struct page_ctor*ctor;
+   struct socket   socket;
+
+#ifdef MPASSTHRU_DEBUG
+   int debug;
+#endif
+};
+
+struct mp_file {
+   atomic_t count;
+   struct mp_struct *mp;
+   struct net *net;
+};
+
+struct mp_sock {
+   struct sock sk;
+   struct mp_struct*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+   int ret = 0;
+
+   rtnl_lock();
+   ret = dev_change_flags(dev, flags);
+   rtnl_unlock();
+
+   if (ret < 0)
+   printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+   return ret;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+   int rc;
+   struct page_ctor *ctor;
+   struct net_device *dev = mp->dev;
+
+   /* locked by mp_mutex */
+   if (rcu_dereference(mp->ctor))
+   return -EBUSY;
+
+

[RFC][PATCH v5 19/19] Provides multiple submits and asynchronous notifications.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui 
---
 drivers/vhost/net.c   |  240 +++-
 drivers/vhost/vhost.c |  120 ++---
 drivers/vhost/vhost.h |   14 +++
 3 files changed, 318 insertions(+), 56 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9777583..b3171ed 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -49,6 +51,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -93,11 +96,138 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(&vq->notify_lock, flags);
+   if (!list_empty(&vq->notifier)) {
+   iocb = list_first_entry(&vq->notifier,
+   struct kiocb, ki_list);
+   list_del(&iocb->ki_list);
+   }
+   spin_unlock_irqrestore(&vq->notify_lock, flags);
+   return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+   struct vhost_virtqueue *vq = iocb->private;
+   unsigned long flags;
+
+   spin_lock_irqsave(&vq->notify_lock, flags);
+   list_add_tail(&iocb->ki_list, &vq->notifier);
+   spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+   return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ struct socket *sock)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (!is_async_vq(vq))
+   return;
+
+   if (sock->sk->sk_data_ready)
+   sock->sk->sk_data_ready(sock->sk, 0);
+
+   vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+   vq->log : NULL;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(&net->dev, vq,
+   iocb->ki_pos, iocb->ki_nbytes);
+   size = iocb->ki_nbytes;
+   head = iocb->ki_pos;
+   rx_total_len += iocb->ki_nbytes;
+
+   if (iocb->ki_dtor)
+   iocb->ki_dtor(iocb);
+   kmem_cache_free(net->cache, iocb);
+
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(&net->dev, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   &out, &in, vq_log,
+   &log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(&vq->poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (!is_async_vq(vq))
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(&net->dev, vq,
+   iocb->ki_pos, 0);
+   tx_total_len += iocb->ki_nbytes;
+
+   if (iocb->ki_dtor)
+   iocb->ki_dtor(iocb);
+
+   kmem_cache_free(net->cache, iocb);
+   if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(&vq->poll);
+   break;
+   }
+   }
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+struct vhost_virtqueue *vq,
+unsigned head)
+{
+   struct kiocb *iocb = NULL;
+
+   if (!is_async_vq(vq))

[RFC][PATCH v5 18/19] Add a kconfig entry and make entry for mp device.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Reviewed-by: Jeff Dike 
---
 drivers/vhost/Kconfig  |   10 ++
 drivers/vhost/Makefile |2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+   tristate "mediate passthru network driver (EXPERIMENTAL)"
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 17/19] Export proto_ops to vhost-net driver.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Currently, vhost-net is only user to the mp device.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 drivers/vhost/mpassthru.c |  322 -
 1 files changed, 318 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 8538a87..96b314a 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -577,8 +577,322 @@ failed:
return NULL;
 }
 
+static void mp_sock_destruct(struct sock *sk)
+{
+   struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+   kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+   if (sk_has_sleeper(sk))
+   wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+   if (sk_has_sleeper(sk))
+   wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+   struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+   struct page_ctor *ctor = NULL;
+   struct sk_buff *skb = NULL;
+   struct page_info *info = NULL;
+   struct ethhdr *eth;
+   struct kiocb *iocb = NULL;
+   int len, i;
+
+   struct virtio_net_hdr hdr = {
+   .flags = 0,
+   .gso_type = VIRTIO_NET_HDR_GSO_NONE
+   };
+
+   ctor = rcu_dereference(mp->ctor);
+   if (!ctor)
+   return;
+
+   while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+   if (skb_shinfo(skb)->destructor_arg) {
+   info = container_of(skb_shinfo(skb)->destructor_arg,
+   struct page_info, ext_page);
+   info->skb = skb;
+   if (skb->len > info->len) {
+   mp->dev->stats.rx_dropped++;
+   DBG(KERN_INFO "Discarded truncated rx packet: "
+   " len %d > %zd\n", skb->len, info->len);
+   info->total = skb->len;
+   goto clean;
+   } else {
+   int i;
+   struct skb_shared_info *gshinfo =
+   (struct skb_shared_info *)
+   (&info->ushinfo);
+   struct skb_shared_info *hshinfo =
+   skb_shinfo(skb);
+
+   if (gshinfo->nr_frags < hshinfo->nr_frags)
+   goto clean;
+   eth = eth_hdr(skb);
+   skb_push(skb, ETH_HLEN);
+
+   hdr.hdr_len = skb_headlen(skb);
+   info->total = skb->len;
+
+   for (i = 0; i < gshinfo->nr_frags; i++)
+   gshinfo->frags[i].size = 0;
+   for (i = 0; i < hshinfo->nr_frags; i++)
+   gshinfo->frags[i].size =
+   hshinfo->frags[i].size;
+   }
+   } else {
+   /* The skb composed with kernel buffers
+* in case external buffers are not sufficent.
+* The case should be rare.
+*/
+   unsigned long flags;
+   int i;
+   struct skb_shared_info *gshinfo = NULL;
+
+   info = NULL;
+
+   spin_lock_irqsave(&ctor->read_lock, flags);
+   if (!list_empty(&ctor->readq)) {
+   info = list_first_entry(&ctor->readq,
+   struct page_info, list);
+   list_del(&info->list);
+   }
+   spin_unlock_irqrestore(&ctor->read_lock, flags);
+   if (!info) {
+   DBG(KERN_INFO
+   "No external buffer avaliable %p\n",
+   skb);
+   skb_queue_head(&sk->sk_receive_queue,
+   skb);
+   break;
+   }
+   info->skb = skb;
+   /* compute the guest skb frags info */
+   gshinfo = (struct skb_shared_info *)
+ (info->ext_page.start +
+ SKB_DATA_ALIGN(info->ext_page.size));
+
+   if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
+   goto clean;
+
+  

[RFC][PATCH v5 16/19] Manipulate external buffers in mp device.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

How external buffer comes from, how to destroy.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 drivers/vhost/mpassthru.c |  254 -
 1 files changed, 251 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 33cc123..8538a87 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -160,6 +160,39 @@ static int mp_dev_change_flags(struct net_device *dev, 
unsigned flags)
return ret;
 }
 
+/* The main function to allocate external buffers */
+static struct skb_external_page *page_ctor(struct mpassthru_port *port,
+   struct sk_buff *skb, int npages)
+{
+   int i;
+   unsigned long flags;
+   struct page_ctor *ctor;
+   struct page_info *info = NULL;
+
+   ctor = container_of(port, struct page_ctor, port);
+
+   spin_lock_irqsave(&ctor->read_lock, flags);
+   if (!list_empty(&ctor->readq)) {
+   info = list_first_entry(&ctor->readq, struct page_info, list);
+   list_del(&info->list);
+   }
+   spin_unlock_irqrestore(&ctor->read_lock, flags);
+   if (!info)
+   return NULL;
+
+   for (i = 0; i < info->pnum; i++) {
+   get_page(info->pages[i]);
+   info->frag[i].page = info->pages[i];
+   info->frag[i].page_offset = i ? 0 : info->offset;
+   info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+   port->data_len;
+   }
+   info->skb = skb;
+   info->ext_page.frags = info->frag;
+   info->ext_page.ushinfo = &info->ushinfo;
+   return &info->ext_page;
+}
+
 static int page_ctor_attach(struct mp_struct *mp)
 {
int rc;
@@ -192,7 +225,7 @@ static int page_ctor_attach(struct mp_struct *mp)
 
dev_hold(dev);
ctor->dev = dev;
-   ctor->port.ctor = NULL;
+   ctor->port.ctor = page_ctor;
ctor->port.sock = &mp->socket;
ctor->lock_pages = 0;
rc = netdev_mp_port_attach(dev, &ctor->port);
@@ -260,11 +293,66 @@ static int set_memlock_rlimit(struct page_ctor *ctor, int 
resource,
return 0;
 }
 
+static void relinquish_resource(struct page_ctor *ctor)
+{
+   if (!(ctor->dev->flags & IFF_UP) &&
+   !(ctor->wq_len + ctor->rq_len))
+   kmem_cache_destroy(ctor->cache);
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+   struct page_info *info = (struct page_info *)(iocb->private);
+   int i;
+
+   if (info->flags == INFO_READ) {
+   for (i = 0; i < info->pnum; i++) {
+   if (info->pages[i]) {
+   set_page_dirty_lock(info->pages[i]);
+   put_page(info->pages[i]);
+   }
+   }
+   info->skb->destructor = NULL;
+   kfree_skb(info->skb);
+   info->ctor->rq_len--;
+   } else
+   info->ctor->wq_len--;
+   /* Decrement the number of locked pages */
+   info->ctor->lock_pages -= info->pnum;
+   kmem_cache_free(info->ctor->cache, info);
+   relinquish_resource(info->ctor);
+
+   return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+   struct kiocb *iocb = NULL;
+
+   iocb = info->iocb;
+   if (!iocb)
+   return iocb;
+   iocb->ki_flags = 0;
+   iocb->ki_users = 1;
+   iocb->ki_key = 0;
+   iocb->ki_ctx = NULL;
+   iocb->ki_cancel = NULL;
+   iocb->ki_retry = NULL;
+   iocb->ki_iovec = NULL;
+   iocb->ki_eventfd = NULL;
+   iocb->ki_pos = info->desc_pos;
+   iocb->ki_nbytes = size;
+   iocb->ki_dtor(iocb);
+   iocb->private = (void *)info;
+   iocb->ki_dtor = mp_ki_dtor;
+
+   return iocb;
+}
+
 static int page_ctor_detach(struct mp_struct *mp)
 {
struct page_ctor *ctor;
struct page_info *info;
-   struct kiocb *iocb = NULL;
int i;
 
/* locked by mp_mutex */
@@ -276,12 +364,17 @@ static int page_ctor_detach(struct mp_struct *mp)
for (i = 0; i < info->pnum; i++)
if (info->pages[i])
put_page(info->pages[i]);
+   create_iocb(info, 0);
+   ctor->rq_len--;
kmem_cache_free(ctor->cache, info);
}
+
+   relinquish_resource(ctor);
+
set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
   ctor->o_rlim.rlim_cur,
   ctor->o_rlim.rlim_max);
-   kmem_cache_destroy(ctor->cache);
+
netdev_mp_port_detach(ctor->dev);
dev_put(ctor->dev);
 
@@ -329,6 +422,161 @@ static void mp_put(struct mp_file *mfile)
mp_detach(mfile->mp);
 }
 
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_external_page *ext_page)
+{
+   struct page_info 

[RFC][PATCH v5 14/19] Add header file for mp device.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/mpassthru.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include 
+#include 
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV  _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV_IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include 
+#include 
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 09/19] Ignore room skb_reserve() when device is using external buffer.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Make the skb->data and skb->head from external buffer
to be consistent, we ignore the room reserved by driver
for kernel skb.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/skbuff.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5ff8c27..193b259 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1200,6 +1200,15 @@ static inline int skb_tailroom(const struct sk_buff *skb)
  */
 static inline void skb_reserve(struct sk_buff *skb, int len)
 {
+   /* Since skb_reserve() is only for an empty buffer,
+* and when the skb is getting external buffer, we cannot
+* retain the external buffer has the same reserved space
+* in the header which kernel allocatd skb has, so have to
+* ignore this. And we have recorded the external buffer
+* info in the destructor_arg field, so use it as indicator.
+*/
+   if (skb_shinfo(skb)->destructor_arg)
+   return;
skb->data += len;
skb->tail += len;
 }
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 07/19] Add interface to get external buffers.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/skbuff.h |   12 
 net/core/skbuff.c  |   16 
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cf309c9..281a1c0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1519,6 +1519,18 @@ static inline void netdev_free_page(struct net_device 
*dev, struct page *page)
__free_page(page);
 }
 
+extern struct skb_external_page *netdev_alloc_external_pages(
+   struct net_device *dev,
+   struct sk_buff *skb, int npages);
+
+static inline struct skb_external_page *netdev_alloc_external_page(
+   struct net_device *dev,
+   struct sk_buff *skb, unsigned int size)
+{
+   return netdev_alloc_external_pages(dev, skb,
+  DIV_ROUND_UP(size, PAGE_SIZE));
+}
+
 /**
  * skb_clone_writable - is the header of a clone writable
  * @skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..6345acc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -278,6 +278,22 @@ struct page *__netdev_alloc_page(struct net_device *dev, 
gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+struct skb_external_page *netdev_alloc_external_pages(struct net_device *dev,
+   struct sk_buff *skb, int npages)
+{
+   struct mpassthru_port *port;
+   struct skb_external_page *ext_page = NULL;
+
+   port = rcu_dereference(dev->mp_port);
+   if (!port)
+   goto out;
+   BUG_ON(npages > port->npages);
+   ext_page = port->ctor(port, skb, npages);
+out:
+   return ext_page;
+}
+EXPORT_SYMBOL(netdev_alloc_external_pages);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size)
 {
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 08/19] Make __alloc_skb() to get external buffer.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Add a dev parameter to __alloc_skb(), skb->data
points to external buffer, recompute skb->head,
maintain shinfo of the external buffer, record
external buffer info into destructor_arg field.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---

__alloc_skb() cleaup by 

Jeff Dike 

 include/linux/skbuff.h |7 ---
 net/core/skbuff.c  |   43 +--
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 281a1c0..5ff8c27 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -442,17 +442,18 @@ extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int fclone,
+  int node, struct net_device *dev);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
-   return __alloc_skb(size, priority, 0, -1);
+   return __alloc_skb(size, priority, 0, -1, NULL);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1, -1);
+   return __alloc_skb(size, priority, 1, -1, NULL);
 }
 
 extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6345acc..ae223d2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -161,7 +161,8 @@ EXPORT_SYMBOL(skb_under_panic);
  * @fclone: allocate from fclone cache instead of head cache
  * and allocate a cloned (child) skb
  * @node: numa node to allocate memory on
- *
+ * @dev: a device owns the skb if the skb try to get external buffer.
+ * otherwise is NULL.
  * Allocate a new &sk_buff. The returned buffer has no headroom and a
  * tail room of size bytes. The object has a reference count of one.
  * The return is the buffer. On a failure the return is %NULL.
@@ -170,12 +171,13 @@ EXPORT_SYMBOL(skb_under_panic);
  * %GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int fclone, int node)
+   int fclone, int node, struct net_device *dev)
 {
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
-   u8 *data;
+   u8 *data = NULL;
+   struct skb_external_page *ext_page = NULL;
 
cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
 
@@ -185,8 +187,23 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
goto out;
 
size = SKB_DATA_ALIGN(size);
-   data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-   gfp_mask, node);
+
+   /* If the device wants to do mediate passthru(zero-copy),
+* the skb may try to get external buffers from outside.
+* If fails, then fall back to alloc buffers from kernel.
+*/
+   if (dev && dev->mp_port) {
+   ext_page = netdev_alloc_external_page(dev, skb, size);
+   if (ext_page) {
+   data = ext_page->start;
+   size = ext_page->size;
+   }
+   }
+
+   if (!data)
+   data = kmalloc_node_track_caller(
+   size + sizeof(struct skb_shared_info),
+   gfp_mask, node);
if (!data)
goto nodata;
 
@@ -208,6 +225,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
skb->mac_header = ~0U;
 #endif
 
+   /* If the skb get external buffers sucessfully, since the shinfo is
+* at the end of the buffer, we may retain the shinfo once we
+* need it sometime.
+*/
+   if (ext_page) {
+   skb->head = skb->data - NET_IP_ALIGN - NET_SKB_PAD;
+   memcpy(ext_page->ushinfo, skb_shinfo(skb),
+  sizeof(struct skb_shared_info));
+   }
/* make sure we initialize shinfo sequentially */
shinfo = skb_shinfo(skb);
atomic_set(&shinfo->dataref, 1);
@@ -231,6 +257,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
child->fclone = SKB_FCLONE_UNAVAILABLE;
}
+   /* Record the external buffer info in this field. It's not so good,
+* but we cannot find another place easily.
+*/
+   shinfo->destructor_arg = ext_page;
+
 out:
return skb;
 nodata:
@@ -259,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
int node = dev->dev.parent ? dev_

[RFC][PATCH v5 05/19] Add a function make external buffer owner to query capability.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/netdevice.h |2 +
 net/core/dev.c|   51 +
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 183c786..31d9c4a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t   napi_gro_frags(struct 
napi_struct *napi);
 extern int netdev_mp_port_attach(struct net_device *dev,
 struct mpassthru_port *port);
 extern void netdev_mp_port_detach(struct net_device *dev);
+extern int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index ecbb6b1..37b389a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2497,6 +2497,57 @@ void netdev_mp_port_detach(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_mp_port_detach);
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port)
+{
+   int rc;
+   int npages, data_len;
+   const struct net_device_ops *ops = dev->netdev_ops;
+
+   /* needed by packet split */
+
+   if (ops->ndo_mp_port_prep) {
+   rc = ops->ndo_mp_port_prep(dev, port);
+   if (rc)
+   return rc;
+   } else {
+   /* If the NIC driver did not report this,
+* then we try to use default value.
+*/
+   port->hdr_len = 128;
+   port->data_len = 2048;
+   port->npages = 1;
+   }
+
+   if (port->hdr_len <= 0)
+   goto err;
+
+   npages = port->npages;
+   data_len = port->data_len;
+   if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+   (data_len < PAGE_SIZE * (npages - 1) ||
+data_len > PAGE_SIZE * npages))
+   goto err;
+
+   return 0;
+err:
+   dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+   return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH v5 00/19] Provide a zero-copy method on KVM virtio-net.

2010-05-07 Thread xiaohui . xin
We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-13:net core changes.
patch 14-18:new device as interface to mantpulate external buffers.
patch 19:   for vhost-net.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:Modify __alloc_skb() function a bit, it can only allocate a 
structure of sk_buff, and the data pointer is pointing to a 
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb->data is filled
directly by h/w. What we have done is in this way.

Pros:   We can avoid any copy here.
Cons:   Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device 
we have bind to do zero-copy. Then we ask guest to do so.


Two:Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)->frags). For 
the head buffer side, let host allocates skb, and h/w fills it. 
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.

Pros:   We could less care the way how guest or host allocates their
buffers.
Cons:   We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

We have got comments from Michael. And he said the first method will break
the compatiblity of virtio-net driver and may complicate the qemu live 
migration. Currently, we tried to ignore the skb_reserve() if the device
is doing zero-copy. Then guest virtio-net driver wil not changed. So we now
continue to go with the first way. 
But comments about the two ways are still appreicated.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
packet split support
To support GRO
Performance tuning

what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
usin

[RFC][PATCH v5 02/19] Add a new struct for device to manipulate external buffer.

2010-05-07 Thread xiaohui . xin
From: Xin Xiaohui 

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---
 include/linux/netdevice.h |   19 ++-
 1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..bae725c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,22 @@ struct netdev_queue {
unsigned long   tx_dropped;
 } cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port  {
+   int hdr_len;
+   int data_len;
+   int npages;
+   unsignedflags;
+   struct socket   *sock;
+   struct skb_external_page *(*ctor)(struct mpassthru_port *,
+   struct sk_buff *, int);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +968,8 @@ struct net_device {
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port*garp_port;
-
+   /* mpassthru */
+   struct mpassthru_port   *mp_port;
/* class/net/name entry */
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.5.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] kvm: calculate correct gfn for small host pages which emulates large guest pages

2010-05-07 Thread Lai Jiangshan
Lai Jiangshan wrote:
> Lai Jiangshan wrote:
>> RFC, because maybe I missing something with the old code.
>>
>> Frome: Lai Jiangshan 
>>
>> In Document/kvm/mmu.txt:
>>   gfn:
>> Either the guest page table containing the translations shadowed by this
>> page, or the base page frame for linear translations. See role.direct.
>>
>> But in function FNAME(fetch)(), sp->gfn is incorrect when one of following
>> situations occurred:
>>  1) guest is 32bit paging and guest uses pse-36 and the guest PDE maps
>> a 4-MByte page(backed by 4k host pages) and bits 20:13 of the guest PDE
>> is not equals to 0.
>>  2) guest is long mode paging and the guest PDPTE maps a 1-GByte page
>> (backed by 4k or 2M host pages)
>>
> 
> Resend this patch with the changelog changed.
> 
> As Marcelo Tosatti and Gui Jianfeng points out, 
> FNAME(fetch)() miss quadrant on 4mb large page emulation with shadow.
> 
> Subject: [PATCH] kvm: calculate correct gfn for small host pages which 
> emulates large guest pages
> 
> In Document/kvm/mmu.txt:
>   gfn:
> Either the guest page table containing the translations shadowed by this
> page, or the base page frame for linear translations. See role.direct.
> 
> But in function FNAME(fetch)(), sp->gfn is incorrect when one of following
> situations occurred:
>  1) guest is 32bit paging and the guest PDE maps a 4-MByte page
> (backed by 4k host pages), FNAME(fetch)() miss handling the quadrant.
> 
> And if guest use pse-36, "table_gfn = gpte_to_gfn(gw->ptes[level - 
> delta]);"
> is incorrect.
>  2) guest is long mode paging and the guest PDPTE maps a 1-GByte page
> (backed by 4k or 2M host pages).
> 
> So we fix it to suit to the document and suit to the code which
> requires sp->gfn correct when sp->role.direct=1.
> 
> We use the goal mapping gfn(gw->gfn) to calculate the base page frame
> for linear translations, it is simple and easy to be understood.
> 
> Signed-off-by: Lai Jiangshan 
> ---

Could you add these:

Reported-by: Marcelo Tosatti 
Reported-by: Gui Jianfeng 

Thanks.
Lai.

PS.
The whole patches includes:
[PATCH] kvm mmu: reduce 50% memory usage
[PATCH] kvm: calculate correct gfn for small host pages which emulates large 
guest pages
[PATCH] kvm, tdp: calculate correct base gfn for non-DIR level
[PATCH] kvm: update document of gfns

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: update document of gfns

2010-05-07 Thread Lai Jiangshan

When don't use ->gfns when role.direct=1, update document for it.

Signed-off-by: Lai Jiangshan 
Requested-by: Avi Kivity 
---
diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt
index da04671..aec19ae 100644
--- a/Documentation/kvm/mmu.txt
+++ b/Documentation/kvm/mmu.txt
@@ -178,7 +178,9 @@ Shadow pages contain the following information:
 guest pages as leaves.
   gfns:
 An array of 512 guest frame numbers, one for each present pte.  Used to
-perform a reverse map from a pte to a gfn.
+perform a reverse map from a pte to a gfn. When role.direct is set, any
+element of this array can be calculated from the gfn field when used, in
+this case, the array of gfns is not allocated. See role.direct and gfn.
   slot_bitmap:
 A bitmap containing one bit per memory slot.  If the page contains a pte
 mapping a page from memory slot n, then bit n of slot_bitmap will be set

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm mmu: reduce 50% memory usage

2010-05-07 Thread Lai Jiangshan
Marcelo Tosatti wrote:
> On Thu, May 06, 2010 at 03:03:48PM +0800, Lai Jiangshan wrote:
>> Marcelo Tosatti wrote:
>>> On Thu, Apr 29, 2010 at 09:43:40PM +0300, Avi Kivity wrote:
 On 04/29/2010 09:09 PM, Marcelo Tosatti wrote:
> You missed quadrant on 4mb large page emulation with shadow (see updated
> patch below).
 Good catch.

> Also for some reason i can't understand the assumption
> does not hold for large sptes with TDP, so reverted for now.
 It's unrelated to TDP, same issue with shadow.  I think the
 calculation is correct.  For example the 4th spte for a level=2 page
 will yield gfn=4*512.
>>> Under testing i see sp at level 2, with sp->gfn == 4096, mmu_set_spte
>>> setting index 8 to gfn 4096 (whereas kvm_mmu_page_get_gfn returns 4096 +
>>> 8*512).
>>>
>>> Lai, can you please take a look at it? You should see the
>>> kvm_mmu_page_set_gfn BUG_ON by using -mem-path on hugetlbfs.
>>>
>> Could you tell me how you test it? It will be better if I follow
>> your test steps.
> 
> mount -t hugetlbfs none /mnt/
> echo xyz > /proc/sys/vm/nr_hugepages
> qemu-kvm parameters -mem-path /mnt/
> 
>> I also hit the kvm_mmu_page_set_gfn BUG_ON, It is because
>> FNAME(fetch)() set sp->gfn wrong. The patch:
>> [PATCH] kvm: calculate correct gfn for small host pages which emulates large 
>> guest pages
>> fix it.
>>
>> I can not hit kvm_mmu_page_set_gfn BUG_ON after this patch also
>> applied.
>>
>> So could you tell me your test steps:
>> The host: ept/npt enabled? 64bit? testing codes in host?
> 
> Intel EPT enabled.
> 
>> The guest: OS? PAE? 32bit? 64bit? testing codes in guest?
> 
> FC12 guest.
> 


I forgot to check if the calculation of base gfn is correct
in __direct_map().

Subject: [PATCH] kvm, tdp: calculate correct base gfn for non-DIR level

the base gfn calculation is incorrect in __direct_map(),
it does not calculate correctly when level=3 or 4.

Signed-off-by: Lai Jiangshan 
Reported-by: Marcelo Tosatti 
---
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a5c6719..6986a6f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1955,7 +1956,10 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
}
 
if (*iterator.sptep == shadow_trap_nonpresent_pte) {
-   pseudo_gfn = (iterator.addr & PT64_DIR_BASE_ADDR_MASK) 
>> PAGE_SHIFT;
+   u64 base_addr = iterator.addr;
+
+   base_addr &= PT64_LVL_ADDR_MASK(iterator.level);
+   pseudo_gfn = base_addr >> PAGE_SHIFT;
sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
  iterator.level - 1,
  1, ACC_ALL, iterator.sptep);



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another SIGFPE in display code, now in cirrus

2010-05-07 Thread Michael Tokarev

07.05.2010 00:07, Michael Tokarev wrote:

There was a bug recently fixed in vnc code.  Apparently
there's something similar in the cirrus emulation as well.
Here it triggers _always_ (including old versions of kvm)
when running windows NT and hitting "test" button in its
display resolution dialog. Here's what gdb is to say:

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0xf76cab70 (LWP 580)]
0x080c5e45 in cirrus_do_copy (s=0x86134dc, dst=96, src=0, w=2, h=9)
at hw/cirrus_vga.c:687
687 sx = (src % ABS(s->cirrus_blt_srcpitch)) / depth;
(gdb) p s->cirrus_blt_srcpitch
$2 = 0

[]

This qemu-kvm-0.12.3 - actually a debian package of it,
but there's no patches relevant to video applied.


I just tried current qemu-kvm/master, it crashes at exactly
the same place:

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0xf79dfb70 (LWP 10840)]
0x0821b4ca in cirrus_do_copy (s=0x85dc7ac)
at hw/cirrus_vga.c:687
687 sx = (src % ABS(s->cirrus_blt_srcpitch)) / depth;

(gdb) bt
#0  0x0821b4ca in cirrus_do_copy (s=0x85dc7ac) at hw/cirrus_vga.c:687
#1  cirrus_bitblt_videotovideo_copy (s=0x85dc7ac) at hw/cirrus_vga.c:748
#2  cirrus_bitblt_videotovideo (s=0x85dc7ac) at hw/cirrus_vga.c:870
#3  cirrus_bitblt_start (s=0x85dc7ac) at hw/cirrus_vga.c:1011
#4  0x0821d009 in cirrus_vga_mem_writel (opaque=0x85dc7ac, addr=98320,
val=96) at hw/cirrus_vga.c:2120
#5  0x0811d147 in cpu_physical_memory_rw (addr=753680, buf=0xf7fdc390 "",
len=4, is_write=1) at exec.c:3475
#6  0x0807b462 in cpu_physical_memory_write () at cpu-common.h:67
#7  kvm_flush_coalesced_mmio_buffer () at kvm-all.c:808
#8  0x0807cc2e in kvm_run (env=0x84c4650) at qemu-kvm.c:575
#9  0x0807d10c in kvm_cpu_exec (env=0x84c4650) at qemu-kvm.c:1192
#10 0x0807ec0a in kvm_main_loop_cpu (_env=0x84c4650) at qemu-kvm.c:1449
#11 ap_main_loop (_env=0x84c4650) at qemu-kvm.c:1495
#12 0xf7fad3d0 in start_thread () from /lib/libpthread.so.0
#13 0xf7cbf10e in clone () from /lib/libc.so.6


Anything can be done with it?

Thanks!


/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2010-05-07 Thread Kevin Wolf
Am 07.05.2010 03:19, schrieb Marcelo Tosatti:
> On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
>> Hi,
>>
>> I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
>> endless loop in qcow2_alloc_cluster_offset, namely over
>> QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight):
>>
>> (gdb) bt
>> #0  0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, 
>> offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at 
>> /data/qemu-kvm/block/qcow2-cluster.c:750
>> #1  0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at 
>> /data/qemu-kvm/block/qcow2.c:587
>> #2  0x00482a44 in qcow_aio_writev (bs=, 
>> sector_num=, qiov=, 
>> nb_sectors=, cb=, opaque=> optimized out>) at /data/qemu-kvm/block/qcow2.c:645
>> #3  0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, 
>> qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 , 
>> opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
>> #4  0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, 
>> buf=0xd67200 "H\a", nb_sectors=16) at /data/qemu-kvm/block.c:1736
>> #5  0x00435581 in ide_sector_write (s=0xc92650) at 
>> /data/qemu-kvm/hw/ide/core.c:622
>> #6  0x00425fc2 in kvm_handle_io (env=) at 
>> /data/qemu-kvm/kvm-all.c:553
>> #7  kvm_run (env=) at /data/qemu-kvm/qemu-kvm.c:964
>> #8  0x00426049 in kvm_cpu_exec (env=0x1000) at 
>> /data/qemu-kvm/qemu-kvm.c:1651
>> #9  0x0042627d in kvm_main_loop_cpu (_env=) at 
>> /data/qemu-kvm/qemu-kvm.c:1893
>> #10 ap_main_loop (_env=) at 
>> /data/qemu-kvm/qemu-kvm.c:1943
>> #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0
>> #12 0x7f48abf0711d in clone () from /lib64/libc.so.6
>> #13 0x in ?? ()
>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>> $5 = (struct QCowL2Meta *) 0xcb3568
>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
>> depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight 
>> = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
>>
>> So next == first.
>>
> 
> Seen the exact same bug twice in a row while installing FC12 with IDE
> disk, current qemu-kvm.git. 
> 
> qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
> -m 1000  -vnc :1 \
> -net nic,model=virtio \
> -net tap,script=/root/ifup.sh -serial stdio \
> -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
> telnet::4445,server,nowait -usbdevice tablet
> 
> Can't reproduce though.

In current git master? That's interesting news. I had kind of expected
it would be fixed with c644db3d.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html