Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Rusty Russell
On Wed, 3 Jun 2009 09:15:32 am Herbert Xu wrote:
> On Tue, Jun 02, 2009 at 11:25:57PM +0930, Rusty Russell wrote:
> > Or, we could just "return NETDEV_TX_BUSY;".  I like that :)
>
> No you should fix it so that you check the queue status after
> transmitting a packet so we never get into this state in the
> first place.

We could figure out if we can take the worst-case packet, and underutilize
our queue.  And fix the other *67* drivers.

Of course that doesn't even work, because we return NETDEV_TX_BUSY from dev.c!

"Hi, core netdevs here.  Don't use NETDEV_TX_BUSY.   Yeah, we can't figure out
how to avoid it either.  But y'know, just hack something together".

Herbert, we are *better* than this!

How's this?  Tested for the virtio_net driver here.


[RFC] net: fix double-tcpdump problem with NETDEV_TX_BUSY.

Herbert shares a distain for drivers returning TX_BUSY because
network taps see packets twice when it's used.  Unfortunately, it's
ubiquitous.

This patch marks packets by (ab)using the "peeked" bit in the skb.
This bit is currently used for packets queued in a socket; we reset it
in dev_queue_xmit and set it when we hand the packet to
dev_queue_xmit_nit.

We also reset it on incoming packets: this is safe, but it might be
sufficient to reset it only in the loopback driver?

diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1678,8 +1678,10 @@ int dev_hard_start_xmit(struct sk_buff *
int rc;
 
if (likely(!skb->next)) {
-   if (!list_empty(&ptype_all))
+   if (!list_empty(&ptype_all) && !skb->peeked) {
dev_queue_xmit_nit(skb, dev);
+   skb->peeked = true;
+   }
 
if (netif_needs_gso(dev, skb)) {
if (unlikely(dev_gso_segment(skb)))
@@ -1796,6 +1798,8 @@ int dev_queue_xmit(struct sk_buff *skb)
struct Qdisc *q;
int rc = -ENOMEM;
 
+   skb->peeked = false;
+
/* GSO will handle the following emulations directly. */
if (netif_needs_gso(dev, skb))
goto gso;
@@ -1942,6 +1946,8 @@ int netif_rx(struct sk_buff *skb)
if (!skb->tstamp.tv64)
net_timestamp(skb);
 
+   skb->peeked = false;
+
/*
 * The code is rearranged so that the path is the most
 * short when CPU is congested, but is still operating.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit

2009-06-02 Thread David Miller
From: Rusty Russell 
Date: Tue, 2 Jun 2009 23:38:29 +0930

> On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
>> From: Patrick Ohly 
>> Date: Mon, 01 Jun 2009 21:47:22 +0200
>>
>> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>> >
>> > Would it be possible to make the new skb_orphan() at the start of
>> > dev_hard_start_xmit() conditionally so that it is not executed for
>> > packets that are to be time stamped?
>> >
>> > As discussed before
>> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
>> > socket pointer is required for sending back the send time stamp from
>> > inside the device driver. Calling skb_orphan() unconditionally as in
>> > this patch would break the hardware time stamping of outgoing packets.
>>
>> Indeed, we need to check that case, at a minimum.
>>
>> And there are other potentially other problems.  For example, I
>> wonder how this interacts with the new TX MMAP af_packet support
>> in net-next-2.6 :-/
> 
> I think I'll do this in the driver for now, and let's revisit doing it 
> generically later?

That might be the best course of action for the time being.
This whole area is a rat's nest.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Herbert Xu
On Tue, Jun 02, 2009 at 11:25:57PM +0930, Rusty Russell wrote:
>
> Or, we could just "return NETDEV_TX_BUSY;".  I like that :)

No you should fix it so that you check the queue status after
transmitting a packet so we never get into this state in the
first place.  NETDEV_TX_BUSY is just passing the problem to
someone else, which is not nice at all.

For example, anyone running tcpdump will now see the packet
twice.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: find_vqs operation starting at arbitrary index

2009-06-02 Thread Michael S. Tsirkin
On Tue, Jun 02, 2009 at 10:39:16PM +0530, Amit Shah wrote:
> On (Tue) Jun 02 2009 [19:32:27], Michael S. Tsirkin wrote:
> > On Tue, Jun 02, 2009 at 08:53:07AM +0930, Rusty Russell wrote:
> > > On Mon, 1 Jun 2009 05:33:48 pm Amit Shah wrote:
> > > > Hello,
> > > >
> > > > The recent find_vqs operation doesn't allow for a vq to be found at an
> > > > arbitrary location; it's meant to be called once at startup to find all
> > > > possible queues and never called again.
> > > >
> > > > This doesn't work for devices which can have queues hot-plugged at
> > > > run-time. This can be made to work by passing the 'start_index' value as
> > > > was done earlier for find_vq, but I doubt something like the following
> > > > will work. The MSI vectors might need some changing as well.
> > > 
> > > There's a fundamental conflict here: find_vqs was added so the PCI MSI 
> > > code 
> > > knows exactly how many virtqueues there are.
> > > 
> > > So you'll need to sort this out with Michael...
> > > 
> > > Thanks,
> > > Rusty.
> > 
> > I'd like to understand the expected usage some more.  If there's still a
> > small # of hotpluggable queues known in advance, you can request all
> > vectors but have the queues disabled, and then enable as queues arrive.
> > In that case, something like resize_queue or enable_queue might be a
> > good idea.
> 
> Yes, it sounds like a good idea and I was debating this exact question
> -- how many queues should I expect. I currently don't expect more than
> 20 but even that sounds like a largish number.
> 
> > OTOH, if you want to have a potentially unlimited number of queues,
> > you will have to allocate a common vector for all of them
> > as on intel we are limited to 256 total vectors for all devices
> > taken together.  Not sure what a good API for that would be.
> 
> A common vector for all the queues in the device, right? And will that
> also mean each queue gets woken up on any activity on any one of them?
> 
>   Amit

Yes. This is what find_vqs currently does if it can't allocate a queue
per VQ (actually, per queue with callback: your queues do have
callbacks?).

Of course we can start contemplating more complicated options
for mapping queues to vectors, and the guest/host ABI
supports this.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: find_vqs operation starting at arbitrary index

2009-06-02 Thread Amit Shah
On (Tue) Jun 02 2009 [19:32:27], Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 08:53:07AM +0930, Rusty Russell wrote:
> > On Mon, 1 Jun 2009 05:33:48 pm Amit Shah wrote:
> > > Hello,
> > >
> > > The recent find_vqs operation doesn't allow for a vq to be found at an
> > > arbitrary location; it's meant to be called once at startup to find all
> > > possible queues and never called again.
> > >
> > > This doesn't work for devices which can have queues hot-plugged at
> > > run-time. This can be made to work by passing the 'start_index' value as
> > > was done earlier for find_vq, but I doubt something like the following
> > > will work. The MSI vectors might need some changing as well.
> > 
> > There's a fundamental conflict here: find_vqs was added so the PCI MSI code 
> > knows exactly how many virtqueues there are.
> > 
> > So you'll need to sort this out with Michael...
> > 
> > Thanks,
> > Rusty.
> 
> I'd like to understand the expected usage some more.  If there's still a
> small # of hotpluggable queues known in advance, you can request all
> vectors but have the queues disabled, and then enable as queues arrive.
> In that case, something like resize_queue or enable_queue might be a
> good idea.

Yes, it sounds like a good idea and I was debating this exact question
-- how many queues should I expect. I currently don't expect more than
20 but even that sounds like a largish number.

> OTOH, if you want to have a potentially unlimited number of queues,
> you will have to allocate a common vector for all of them
> as on intel we are limited to 256 total vectors for all devices
> taken together.  Not sure what a good API for that would be.

A common vector for all the queues in the device, right? And will that
also mean each queue gets woken up on any activity on any one of them?

Amit
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: find_vqs operation starting at arbitrary index

2009-06-02 Thread Michael S. Tsirkin
On Tue, Jun 02, 2009 at 08:53:07AM +0930, Rusty Russell wrote:
> On Mon, 1 Jun 2009 05:33:48 pm Amit Shah wrote:
> > Hello,
> >
> > The recent find_vqs operation doesn't allow for a vq to be found at an
> > arbitrary location; it's meant to be called once at startup to find all
> > possible queues and never called again.
> >
> > This doesn't work for devices which can have queues hot-plugged at
> > run-time. This can be made to work by passing the 'start_index' value as
> > was done earlier for find_vq, but I doubt something like the following
> > will work. The MSI vectors might need some changing as well.
> 
> There's a fundamental conflict here: find_vqs was added so the PCI MSI code 
> knows exactly how many virtqueues there are.
> 
> So you'll need to sort this out with Michael...
> 
> Thanks,
> Rusty.

I'd like to understand the expected usage some more.  If there's still a
small # of hotpluggable queues known in advance, you can request all
vectors but have the queues disabled, and then enable as queues arrive.
In that case, something like resize_queue or enable_queue might be a
good idea.

OTOH, if you want to have a potentially unlimited number of queues,
you will have to allocate a common vector for all of them
as on intel we are limited to 256 total vectors for all devices
taken together.  Not sure what a good API for that would be.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 12/13] qemu: virtio save/load bindings

2009-06-02 Thread Michael S. Tsirkin
Implement bindings for virtio save/load. Use them in virtio pci.

Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio-pci.c |   49 -
 hw/virtio.c |   31 ++-
 hw/virtio.h |4 
 3 files changed, 66 insertions(+), 18 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 294f4c7..589fbb1 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -105,6 +105,48 @@ static void virtio_pci_notify(void *opaque, uint16_t 
vector)
 qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
+static void virtio_pci_save_config(void * opaque, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+pci_device_save(&proxy->pci_dev, f);
+msix_save(&proxy->pci_dev, f);
+if (msix_present(&proxy->pci_dev))
+qemu_put_be16(f, proxy->vdev->config_vector);
+}
+
+static void virtio_pci_save_queue(void * opaque, int n, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+if (msix_present(&proxy->pci_dev))
+qemu_put_be16(f, virtio_queue_vector(proxy->vdev, n));
+}
+
+static int virtio_pci_load_config(void * opaque, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+int ret;
+ret = pci_device_load(&proxy->pci_dev, f);
+if (ret)
+return ret;
+ret = msix_load(&proxy->pci_dev, f);
+if (ret)
+return ret;
+if (msix_present(&proxy->pci_dev))
+qemu_get_be16s(f, &proxy->vdev->config_vector);
+return 0;
+}
+
+static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+uint16_t vector;
+if (!msix_present(&proxy->pci_dev))
+return 0;
+qemu_get_be16s(f, &vector);
+virtio_queue_set_vector(proxy->vdev, n, vector);
+return 0;
+}
+
 static void virtio_pci_reset(void *opaque)
 {
 VirtIOPCIProxy *proxy = opaque;
@@ -317,7 +359,12 @@ static void virtio_write_config(PCIDevice *pci_dev, 
uint32_t address,
 }
 
 static const VirtIOBindings virtio_pci_bindings = {
-.notify = virtio_pci_notify
+.notify = virtio_pci_notify,
+.save_config = virtio_pci_save_config,
+.load_config = virtio_pci_load_config,
+.save_config = virtio_pci_save_config,
+.save_queue = virtio_pci_save_queue,
+.load_queue = virtio_pci_load_queue,
 };
 
 static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev,
diff --git a/hw/virtio.c b/hw/virtio.c
index 63ffcff..b773dff 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -568,9 +568,8 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 {
 int i;
 
-/* FIXME: load/save binding.  */
-//pci_device_save(&vdev->pci_dev, f);
-//msix_save(&vdev->pci_dev, f);
+if (vdev->binding->save_config)
+vdev->binding->save_config(vdev->binding_opaque, f);
 
 qemu_put_8s(f, &vdev->status);
 qemu_put_8s(f, &vdev->isr);
@@ -596,19 +595,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 qemu_put_be32(f, vdev->vq[i].vring.num);
 qemu_put_be64(f, vdev->vq[i].pa);
 qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
-if (vdev->nvectors)
-qemu_put_be16s(f, &vdev->vq[i].vector);
+if (vdev->binding->save_queue)
+vdev->binding->save_queue(vdev->binding_opaque, i, f);
 }
 }
 
 int virtio_load(VirtIODevice *vdev, QEMUFile *f)
 {
-int num, i;
+int num, i, ret;
 
-/* FIXME: load/save binding.  */
-//pci_device_load(&vdev->pci_dev, f);
-//r = msix_load(&vdev->pci_dev, f);
-//pci_resize_io_region(&vdev->pci_dev, 1, msix_bar_size(&vdev->pci_dev));
+if (vdev->binding->load_config) {
+ret = vdev->binding->load_config(vdev->binding_opaque, f);
+if (ret)
+return ret;
+}
 
 qemu_get_8s(f, &vdev->status);
 qemu_get_8s(f, &vdev->isr);
@@ -617,10 +617,6 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
 vdev->config_len = qemu_get_be32(f);
 qemu_get_buffer(f, vdev->config, vdev->config_len);
 
-if (vdev->nvectors) {
-qemu_get_be16s(f, &vdev->config_vector);
-//msix_vector_use(&vdev->pci_dev, vdev->config_vector);
-}
 num = qemu_get_be32(f);
 
 for (i = 0; i < num; i++) {
@@ -631,9 +627,10 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
 if (vdev->vq[i].pa) {
 virtqueue_init(&vdev->vq[i]);
 }
-if (vdev->nvectors) {
-qemu_get_be16s(f, &vdev->vq[i].vector);
-//msix_vector_use(&vdev->pci_dev, vdev->config_vector);
+if (vdev->binding->load_queue) {
+ret = vdev->binding->load_queue(vdev->binding_opaque, i, f);
+if (ret)
+return ret;
 }
 }
 
diff --git a/hw/virtio.h b/hw/virtio.h
index 04a3c3d..ce05517 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -72,6 +72,10 @@ typedef struct VirtQueueElement
 
 typedef struct {
 void (*notify)(void * opaque, uint16_t vector);
+void (*save_config)(void * opaque, QEMUFile *f);
+void (*save_queue)(void

[PATCHv2 10/13] qemu: MSI-X support in virtio PCI

2009-06-02 Thread Michael S. Tsirkin
This enables actual support for MSI-X in virtio PCI.
First user will be virtio-net.

Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio-pci.c |  152 --
 1 files changed, 112 insertions(+), 40 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 7dfdd80..294f4c7 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -18,6 +18,7 @@
 #include "virtio.h"
 #include "pci.h"
 //#include "sysemu.h"
+#include "msix.h"
 
 /* from Linux's linux/virtio_pci.h */
 
@@ -47,7 +48,24 @@
  * a read-and-acknowledge. */
 #define VIRTIO_PCI_ISR  19
 
-#define VIRTIO_PCI_CONFIG   20
+/* MSI-X registers: only enabled if MSI-X is enabled. */
+/* A 16-bit vector for configuration changes. */
+#define VIRTIO_MSI_CONFIG_VECTOR20
+/* A 16-bit vector for selected queue notifications. */
+#define VIRTIO_MSI_QUEUE_VECTOR 22
+
+/* Config space size */
+#define VIRTIO_PCI_CONFIG_NOMSI 20
+#define VIRTIO_PCI_CONFIG_MSI   24
+#define VIRTIO_PCI_REGION_SIZE(dev) (msix_present(dev) ? \
+ VIRTIO_PCI_CONFIG_MSI : \
+ VIRTIO_PCI_CONFIG_NOMSI)
+
+/* The remaining space is defined by each driver as the per-driver
+ * configuration space */
+#define VIRTIO_PCI_CONFIG(dev)  (msix_enabled(dev) ? \
+ VIRTIO_PCI_CONFIG_MSI : \
+ VIRTIO_PCI_CONFIG_NOMSI)
 
 /* Virtio ABI version, if we increment this, we break the guest driver. */
 #define VIRTIO_PCI_ABI_VERSION  0
@@ -81,14 +99,17 @@ typedef struct {
 static void virtio_pci_notify(void *opaque, uint16_t vector)
 {
 VirtIOPCIProxy *proxy = opaque;
-
-qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
+if (msix_enabled(&proxy->pci_dev))
+msix_notify(&proxy->pci_dev, vector);
+else
+qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
 static void virtio_pci_reset(void *opaque)
 {
 VirtIOPCIProxy *proxy = opaque;
 virtio_reset(proxy->vdev);
+msix_reset(&proxy->pci_dev);
 }
 
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
@@ -97,8 +118,6 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 VirtIODevice *vdev = proxy->vdev;
 target_phys_addr_t pa;
 
-addr -= proxy->addr;
-
 switch (addr) {
 case VIRTIO_PCI_GUEST_FEATURES:
/* Guest does not negotiate properly?  We have to assume nothing. */
@@ -131,17 +150,33 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 if (vdev->status == 0)
 virtio_pci_reset(proxy);
 break;
+case VIRTIO_MSI_CONFIG_VECTOR:
+msix_vector_unuse(&proxy->pci_dev, vdev->config_vector);
+/* Make it possible for guest to discover an error took place. */
+if (msix_vector_use(&proxy->pci_dev, val) < 0)
+val = VIRTIO_NO_VECTOR;
+vdev->config_vector = val;
+break;
+case VIRTIO_MSI_QUEUE_VECTOR:
+msix_vector_unuse(&proxy->pci_dev,
+  virtio_queue_vector(vdev, vdev->queue_sel));
+/* Make it possible for guest to discover an error took place. */
+if (msix_vector_use(&proxy->pci_dev, val) < 0)
+val = VIRTIO_NO_VECTOR;
+virtio_queue_set_vector(vdev, vdev->queue_sel, val);
+break;
+default:
+fprintf(stderr, "%s: unexpected address 0x%x value 0x%x\n",
+__func__, addr, val);
+break;
 }
 }
 
-static uint32_t virtio_ioport_read(void *opaque, uint32_t addr)
+static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, uint32_t addr)
 {
-VirtIOPCIProxy *proxy = opaque;
 VirtIODevice *vdev = proxy->vdev;
 uint32_t ret = 0x;
 
-addr -= proxy->addr;
-
 switch (addr) {
 case VIRTIO_PCI_HOST_FEATURES:
 ret = vdev->get_features(vdev);
@@ -169,6 +204,12 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t 
addr)
 vdev->isr = 0;
 qemu_set_irq(proxy->pci_dev.irq[0], 0);
 break;
+case VIRTIO_MSI_CONFIG_VECTOR:
+ret = vdev->config_vector;
+break;
+case VIRTIO_MSI_QUEUE_VECTOR:
+ret = virtio_queue_vector(vdev, vdev->queue_sel);
+break;
 default:
 break;
 }
@@ -179,42 +220,72 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t 
addr)
 static uint32_t virtio_pci_config_readb(void *opaque, uint32_t addr)
 {
 VirtIOPCIProxy *proxy = opaque;
-addr -= proxy->addr + VIRTIO_PCI_CONFIG;
+uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
+addr -= proxy->addr;
+if (addr < config)
+return virtio_ioport_read(proxy, addr);
+addr -= config;
 return virtio_config_readb(proxy->vdev, addr);
 }
 
 static uint32_t virtio_pci_config_readw(void *opaque, uint32_t addr)
 {
 VirtIOPCIProxy *proxy = 

[PATCHv2 05/13] qemu: MSI-X support functions

2009-06-02 Thread Michael S. Tsirkin
Add functions implementing MSI-X support. First user will be virtio-pci.
Note that platform must set a flag to declare MSI supported.
For PC this will be set by APIC.

Signed-off-by: Michael S. Tsirkin 
---
 Makefile.target |2 +-
 hw/msix.c   |  423 +++
 hw/msix.h   |   35 +
 hw/pci.h|   20 +++
 4 files changed, 479 insertions(+), 1 deletions(-)
 create mode 100644 hw/msix.c
 create mode 100644 hw/msix.h

diff --git a/Makefile.target b/Makefile.target
index 664a1e3..87b2859 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -486,7 +486,7 @@ endif #CONFIG_BSD_USER
 ifndef CONFIG_USER_ONLY
 
 OBJS=vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \
- gdbstub.o gdbstub-xml.o
+ gdbstub.o gdbstub-xml.o msix.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
 OBJS+=virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o
diff --git a/hw/msix.c b/hw/msix.c
new file mode 100644
index 000..4db2fc1
--- /dev/null
+++ b/hw/msix.c
@@ -0,0 +1,423 @@
+/*
+ * MSI-X device support
+ *
+ * This module includes support for MSI-X in pci devices.
+ *
+ * Author: Michael S. Tsirkin 
+ *
+ *  Copyright (c) 2009, Red Hat Inc, Michael S. Tsirkin (m...@redhat.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "hw.h"
+#include "msix.h"
+#include "pci.h"
+
+/* Declaration from linux/pci_regs.h */
+#define  PCI_CAP_ID_MSIX 0x11 /* MSI-X */
+#define  PCI_MSIX_FLAGS 2 /* Table at lower 11 bits */
+#define  PCI_MSIX_FLAGS_QSIZE  0x7FF
+#define  PCI_MSIX_FLAGS_ENABLE (1 << 15)
+#define  PCI_MSIX_FLAGS_BIRMASK(7 << 0)
+
+/* MSI-X capability structure */
+#define MSIX_TABLE_OFFSET 4
+#define MSIX_PBA_OFFSET 8
+#define MSIX_CAP_LENGTH 12
+
+/* MSI enable bit is in byte 1 in FLAGS register */
+#define MSIX_ENABLE_OFFSET (PCI_MSIX_FLAGS + 1)
+#define MSIX_ENABLE_MASK (PCI_MSIX_FLAGS_ENABLE >> 8)
+
+/* MSI-X table format */
+#define MSIX_MSG_ADDR 0
+#define MSIX_MSG_UPPER_ADDR 4
+#define MSIX_MSG_DATA 8
+#define MSIX_VECTOR_CTRL 12
+#define MSIX_ENTRY_SIZE 16
+#define MSIX_VECTOR_MASK 0x1
+
+/* How much space does an MSIX table need. */
+/* The spec requires giving the table structure
+ * a 4K aligned region all by itself. Align it to
+ * target pages so that drivers can do passthrough
+ * on the rest of the region. */
+#define MSIX_PAGE_SIZE TARGET_PAGE_ALIGN(0x1000)
+/* Reserve second half of the page for pending bits */
+#define MSIX_PAGE_PENDING (MSIX_PAGE_SIZE / 2)
+#define MSIX_MAX_ENTRIES 32
+
+
+#ifdef MSIX_DEBUG
+#define DEBUG(fmt, ...)   \
+do {  \
+  fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__);\
+} while (0)
+#else
+#define DEBUG(fmt, ...) do { } while(0)
+#endif
+
+/* Flag to globally disable MSI-X support */
+int msix_disable;
+
+/* Flag for interrupt controller to declare MSI-X support */
+int msix_supported;
+
+/* Add MSI-X capability to the config space for the device. */
+/* Given a bar and its size, add MSI-X table on top of it
+ * and fill MSI-X capability in the config space.
+ * Original bar size must be a power of 2 or 0.
+ * New bar size is returned. */
+static int msix_add_config(struct PCIDevice *pdev, unsigned short nentries,
+   unsigned bar_nr, unsigned bar_size)
+{
+int config_offset;
+uint8_t *config;
+uint32_t new_size;
+
+if (nentries < 1 || nentries > PCI_MSIX_FLAGS_QSIZE + 1)
+return -EINVAL;
+if (bar_size > 0x8000)
+return -ENOSPC;
+
+/* Add space for MSI-X structures */
+if (!bar_size)
+new_size = MSIX_PAGE_SIZE;
+else if (bar_size < MSIX_PAGE_SIZE) {
+bar_size = MSIX_PAGE_SIZE;
+new_size = MSIX_PAGE_SIZE * 2;
+} else
+new_size = bar_size * 2;
+
+pdev->msix_bar_size = new_size;
+config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, MSIX_CAP_LENGTH);
+if (config_offset < 0)
+return config_offset;
+config = pdev->config + config_offset;
+
+pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1);
+/* Table on top of BAR */
+pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr);
+/* Pending bits on top of that */
+pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) |
+ bar_nr);
+pdev->msix_cap = config_offset;
+/* Make flags bit writeable. */
+pdev->mask[config_offset + MSIX_ENABLE_OFFSET] |= MSIX_ENABLE_MASK;
+return 0;
+}
+
+static void msix_free_irq_entries(PCIDevice *dev)
+{
+int vector;
+
+for (vector = 0; vector < dev->msix_entries_nr; ++vector)
+dev->msix_entry_used[vector] = 0;
+}
+
+/* Handle MSI-X capability config write. */
+void msix_write_config(PCIDevice *dev, uint32_t addr,
+   ui

[PATCHv2 13/13] qemu: add pci_get/set_byte

2009-06-02 Thread Michael S. Tsirkin
Add pci_get/set_byte to keep *_word and *_long access functions company.
They are unused for now.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/hw/pci.h b/hw/pci.h
index 4072f16..e1e4fb4 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -263,6 +263,18 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t 
vid, uint16_t did,
 pci_map_irq_fn map_irq, const char *name);
 
 static inline void
+pci_set_byte(uint8_t *config, uint8_t val)
+{
+*config = val;
+}
+
+static inline uint8_t
+pci_get_byte(uint8_t *config)
+{
+return *config;
+}
+
+static inline void
 pci_set_word(uint8_t *config, uint16_t val)
 {
 cpu_to_le16wu((uint16_t *)config, val);
-- 
1.6.3.1.56.g79e1.dirty
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 09/13] qemu: virtio support for many interrupt vectors

2009-06-02 Thread Michael S. Tsirkin
Extend virtio to support many interrupt vectors, and rearrange code in
preparation for multi-vector support (mostly move reset out to bindings,
because we will have to reset the vectors in transport-specific code).
Actual bindings in pci, and use in net, to follow.
Load and save are not connected to bindings yet, so they are left
stubbed out for now.

Signed-off-by: Michael S. Tsirkin 
---
 hw/syborg_virtio.c |   13 --
 hw/virtio-pci.c|   24 +++
 hw/virtio.c|   63 ++-
 hw/virtio.h|   10 ++-
 4 files changed, 84 insertions(+), 26 deletions(-)

diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
index 37c219c..d8c978a 100644
--- a/hw/syborg_virtio.c
+++ b/hw/syborg_virtio.c
@@ -134,7 +134,10 @@ static void syborg_virtio_writel(void *opaque, 
target_phys_addr_t offset,
 vdev->features = value;
 break;
 case SYBORG_VIRTIO_QUEUE_BASE:
-virtio_queue_set_addr(vdev, vdev->queue_sel, value);
+if (value == 0)
+virtio_reset(vdev);
+else
+virtio_queue_set_addr(vdev, vdev->queue_sel, value);
 break;
 case SYBORG_VIRTIO_QUEUE_SEL:
 if (value < VIRTIO_PCI_QUEUE_MAX)
@@ -228,7 +231,7 @@ static CPUWriteMemoryFunc *syborg_virtio_writefn[] = {
  syborg_virtio_writel
 };
 
-static void syborg_virtio_update_irq(void *opaque)
+static void syborg_virtio_update_irq(void *opaque, uint16_t vector)
 {
 SyborgVirtIOProxy *proxy = opaque;
 int level;
@@ -239,7 +242,7 @@ static void syborg_virtio_update_irq(void *opaque)
 }
 
 static VirtIOBindings syborg_virtio_bindings = {
-.update_irq = syborg_virtio_update_irq
+.notify = syborg_virtio_update_irq
 };
 
 static void syborg_virtio_init(SyborgVirtIOProxy *proxy, VirtIODevice *vdev)
@@ -248,6 +251,8 @@ static void syborg_virtio_init(SyborgVirtIOProxy *proxy, 
VirtIODevice *vdev)
 
 proxy->vdev = vdev;
 
+/* Don't support multiple vectors */
+proxy->vdev->nvectors = 0;
 sysbus_init_irq(&proxy->busdev, &proxy->irq);
 iomemtype = cpu_register_io_memory(0, syborg_virtio_readfn,
syborg_virtio_writefn, proxy);
@@ -255,6 +260,8 @@ static void syborg_virtio_init(SyborgVirtIOProxy *proxy, 
VirtIODevice *vdev)
 
 proxy->id = ((uint32_t)0x1af4 << 16) | vdev->device_id;
 
+qemu_register_reset(virtio_reset, 0, vdev);
+
 virtio_bind_device(vdev, &syborg_virtio_bindings, proxy);
 }
 
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index c072423..7dfdd80 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -78,13 +78,19 @@ typedef struct {
 
 /* virtio device */
 
-static void virtio_pci_update_irq(void *opaque)
+static void virtio_pci_notify(void *opaque, uint16_t vector)
 {
 VirtIOPCIProxy *proxy = opaque;
 
 qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
+static void virtio_pci_reset(void *opaque)
+{
+VirtIOPCIProxy *proxy = opaque;
+virtio_reset(proxy->vdev);
+}
+
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
 VirtIOPCIProxy *proxy = opaque;
@@ -108,7 +114,10 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 break;
 case VIRTIO_PCI_QUEUE_PFN:
 pa = (target_phys_addr_t)val << VIRTIO_PCI_QUEUE_ADDR_SHIFT;
-virtio_queue_set_addr(vdev, vdev->queue_sel, pa);
+if (pa == 0)
+virtio_pci_reset(proxy);
+else
+virtio_queue_set_addr(vdev, vdev->queue_sel, pa);
 break;
 case VIRTIO_PCI_QUEUE_SEL:
 if (val < VIRTIO_PCI_QUEUE_MAX)
@@ -120,7 +129,7 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 case VIRTIO_PCI_STATUS:
 vdev->status = val & 0xFF;
 if (vdev->status == 0)
-virtio_reset(vdev);
+virtio_pci_reset(proxy);
 break;
 }
 }
@@ -158,7 +167,7 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t 
addr)
 /* reading from the ISR also clears it. */
 ret = vdev->isr;
 vdev->isr = 0;
-virtio_update_irq(vdev);
+qemu_set_irq(proxy->pci_dev.irq[0], 0);
 break;
 default:
 break;
@@ -243,7 +252,7 @@ static void virtio_map(PCIDevice *pci_dev, int region_num,
 }
 
 static const VirtIOBindings virtio_pci_bindings = {
-.update_irq = virtio_pci_update_irq
+.notify = virtio_pci_notify
 };
 
 static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev,
@@ -255,6 +264,9 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, 
VirtIODevice *vdev,
 
 proxy->vdev = vdev;
 
+/* No support for multiple vectors yet. */
+proxy->vdev->nvectors = 0;
+
 config = proxy->pci_dev.config;
 pci_config_set_vendor_id(config, vendor);
 pci_config_set_device_id(config, device);
@@ -279,6 +291,8 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, 
VirtIODevice *vdev,
 pci_register_io_region(&pr

[PATCHv2 07/13] qemu: minimal MSI/MSI-X implementation for PC

2009-06-02 Thread Michael S. Tsirkin
Implement MSI support in APIC. Note that MSI and MMIO APIC registers
are at the same memory location, but actually not on the global bus: MSI
is on PCI bus, APIC is connected directly to the CPU. We map them on the
global bus at the same address which happens to work because MSI
registers are reserved in APIC MMIO and vice versa.

Signed-off-by: Michael S. Tsirkin 
---
 hw/apic.c |   50 ++
 1 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 8c8b2de..d831709 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -19,6 +19,8 @@
  */
 #include "hw.h"
 #include "pc.h"
+#include "pci.h"
+#include "msix.h"
 #include "qemu-timer.h"
 #include "host-utils.h"
 
@@ -63,6 +65,19 @@
 #define MAX_APICS 255
 #define MAX_APIC_WORDS 8
 
+/* Intel APIC constants: from include/asm/msidef.h */
+#define MSI_DATA_VECTOR_SHIFT  0
+#define MSI_DATA_VECTOR_MASK   0x00ff
+#define MSI_DATA_DELIVERY_MODE_SHIFT   8
+#define MSI_DATA_TRIGGER_SHIFT 15
+#define MSI_DATA_LEVEL_SHIFT   14
+#define MSI_ADDR_DEST_MODE_SHIFT   2
+#define MSI_ADDR_DEST_ID_SHIFT 12
+#defineMSI_ADDR_DEST_ID_MASK   0x000
+
+#define MSI_ADDR_BASE   0xfee0
+#define MSI_ADDR_SIZE   0x10
+
 typedef struct APICState {
 CPUState *cpu_env;
 uint32_t apicbase;
@@ -86,6 +101,13 @@ typedef struct APICState {
 QEMUTimer *timer;
 } APICState;
 
+struct msi_state {
+uint64_t addr;
+uint32_t data;
+int mask;
+int pending;
+};
+
 static int apic_io_memory;
 static APICState *local_apics[MAX_APICS + 1];
 static int last_apic_id = 0;
@@ -712,11 +734,31 @@ static uint32_t apic_mem_readl(void *opaque, 
target_phys_addr_t addr)
 return val;
 }
 
+static void apic_send_msi(target_phys_addr_t addr, uint32 data)
+{
+uint8_t dest = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
+uint8_t vector = (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
+uint8_t dest_mode = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
+uint8_t trigger_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+uint8_t delivery = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
+/* XXX: Ignore redirection hint. */
+apic_deliver_irq(dest, dest_mode, delivery, vector, 0, trigger_mode);
+}
+
 static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t 
val)
 {
 CPUState *env;
 APICState *s;
-int index;
+int index = (addr >> 4) & 0xff;
+if (addr > 0xfff || !index) {
+/* MSI and MMIO APIC are at the same memory location,
+ * but actually not on the global bus: MSI is on PCI bus
+ * APIC is connected directly to the CPU.
+ * Mapping them on the global bus happens to work because
+ * MSI registers are reserved in APIC MMIO and vice versa. */
+apic_send_msi(addr, val);
+return;
+}
 
 env = cpu_single_env;
 if (!env)
@@ -727,7 +769,6 @@ static void apic_mem_writel(void *opaque, 
target_phys_addr_t addr, uint32_t val)
 printf("APIC write: %08x = %08x\n", (uint32_t)addr, val);
 #endif
 
-index = (addr >> 4) & 0xff;
 switch(index) {
 case 0x02:
 s->id = (val >> 24);
@@ -911,6 +952,7 @@ int apic_init(CPUState *env)
 s->cpu_env = env;
 
 apic_reset(s);
+msix_supported = 1;
 
 /* XXX: mapping more APICs at the same memory location */
 if (apic_io_memory == 0) {
@@ -918,7 +960,8 @@ int apic_init(CPUState *env)
on the global memory bus. */
 apic_io_memory = cpu_register_io_memory(0, apic_mem_read,
 apic_mem_write, NULL);
-cpu_register_physical_memory(s->apicbase & ~0xfff, 0x1000,
+/* XXX: what if the base changes? */
+cpu_register_physical_memory(MSI_ADDR_BASE, MSI_ADDR_SIZE,
  apic_io_memory);
 }
 s->timer = qemu_new_timer(vm_clock, apic_timer, s);
@@ -929,4 +972,3 @@ int apic_init(CPUState *env)
 local_apics[s->id] = s;
 return 0;
 }
-
-- 
1.6.3.1.56.g79e1.dirty

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit

2009-06-02 Thread Patrick Ohly
On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

Would it be possible to make the new skb_orphan() at the start of
dev_hard_start_xmit() conditionally so that it is not executed for
packets that are to be time stamped?

As discussed before
(http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
socket pointer is required for sending back the send time stamp from
inside the device driver. Calling skb_orphan() unconditionally as in
this patch would break the hardware time stamping of outgoing packets.

This should work without much overhead (skb_tx() expands to a lookup of
the skb_shared_info):
if (!skb_tx(skb)->flags)
skb_orphan(skb);

Instead of looking at the individual skb_shared_tx "hardware" and
"software" bits I'm just checking the whole byte here because it is
shorter and perhaps faster. Semantically the result is the same.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 08/13] qemu: add support for resizing regions

2009-06-02 Thread Michael S. Tsirkin
Make it possible to resize PCI regions.  This will be used by virtio
with MSI-X, where the region size depends on whether MSI-X is enabled,
and can change across load/save.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |   54 --
 hw/pci.h |3 +++
 2 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 31ba2ed..a63d988 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -392,6 +392,41 @@ void pci_register_io_region(PCIDevice *pci_dev, int 
region_num,
 *(uint32_t *)(pci_dev->mask + addr) = cpu_to_le32(mask);
 }
 
+static void pci_unmap_region(PCIDevice *d, PCIIORegion *r)
+{
+if (r->addr == -1)
+return;
+if (r->type & PCI_ADDRESS_SPACE_IO) {
+int class;
+/* NOTE: specific hack for IDE in PC case:
+   only one byte must be mapped. */
+class = pci_get_word(d->config + PCI_CLASS_DEVICE);
+if (class == 0x0101 && r->size == 4) {
+isa_unassign_ioport(r->addr + 2, 1);
+} else {
+isa_unassign_ioport(r->addr, r->size);
+}
+} else {
+cpu_register_physical_memory(pci_to_cpu_addr(r->addr),
+ r->size,
+ IO_MEM_UNASSIGNED);
+qemu_unregister_coalesced_mmio(r->addr, r->size);
+}
+}
+
+void pci_resize_io_region(PCIDevice *pci_dev, int region_num,
+  uint32_t size)
+{
+
+PCIIORegion *r = &pci_dev->io_regions[region_num];
+if (r->size == size)
+return;
+r->size = size;
+pci_unmap_region(pci_dev, r);
+r->addr = -1;
+pci_update_mappings(pci_dev);
+}
+
 static void pci_update_mappings(PCIDevice *d)
 {
 PCIIORegion *r;
@@ -445,24 +480,7 @@ static void pci_update_mappings(PCIDevice *d)
 }
 /* now do the real mapping */
 if (new_addr != r->addr) {
-if (r->addr != -1) {
-if (r->type & PCI_ADDRESS_SPACE_IO) {
-int class;
-/* NOTE: specific hack for IDE in PC case:
-   only one byte must be mapped. */
-class = d->config[0x0a] | (d->config[0x0b] << 8);
-if (class == 0x0101 && r->size == 4) {
-isa_unassign_ioport(r->addr + 2, 1);
-} else {
-isa_unassign_ioport(r->addr, r->size);
-}
-} else {
-cpu_register_physical_memory(pci_to_cpu_addr(r->addr),
- r->size,
- IO_MEM_UNASSIGNED);
-qemu_unregister_coalesced_mmio(r->addr, r->size);
-}
-}
+pci_unmap_region(d, r);
 r->addr = new_addr;
 if (r->addr != -1) {
 r->map_func(d, i, r->addr, r->size, r->type);
diff --git a/hw/pci.h b/hw/pci.h
index 6da626b..4072f16 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -221,6 +221,9 @@ void pci_register_io_region(PCIDevice *pci_dev, int 
region_num,
 uint32_t size, int type,
 PCIMapIORegionFunc *map_func);
 
+void pci_resize_io_region(PCIDevice *pci_dev, int region_num,
+  uint32_t size);
+
 int pci_add_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size);
 
 void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size);
-- 
1.6.3.1.56.g79e1.dirty

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 11/13] qemu: request 3 vectors in virtio-net

2009-06-02 Thread Michael S. Tsirkin
Request up to 3 vectors in virtio-net. Actual bindings might supply
less.

Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio-net.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 60aa6da..6118fe3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -621,6 +621,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev)
 n->mac_table.macs = qemu_mallocz(MAC_TABLE_ENTRIES * ETH_ALEN);
 
 n->vlans = qemu_mallocz(MAX_VLAN >> 3);
+n->vdev.nvectors = 3;
 
 register_savevm("virtio-net", virtio_net_id++, VIRTIO_NET_VM_VERSION,
 virtio_net_save, virtio_net_load, n);
-- 
1.6.3.1.56.g79e1.dirty

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 04/13] qemu: helper routines for pci access.

2009-06-02 Thread Michael S. Tsirkin
Add inline routines for convenient access to pci devices
with correct (little) endianness. Will be used by MSI-X support.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.h |   30 +++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/hw/pci.h b/hw/pci.h
index 40137c6..4e112a3 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -240,21 +240,45 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t 
vid, uint16_t did,
 pci_map_irq_fn map_irq, const char *name);
 
 static inline void
+pci_set_word(uint8_t *config, uint16_t val)
+{
+cpu_to_le16wu((uint16_t *)config, val);
+}
+
+static inline uint16_t
+pci_get_word(uint8_t *config)
+{
+return le16_to_cpupu((uint16_t *)config);
+}
+
+static inline void
+pci_set_long(uint8_t *config, uint32_t val)
+{
+cpu_to_le32wu((uint32_t *)config, val);
+}
+
+static inline uint32_t
+pci_get_long(uint8_t *config)
+{
+return le32_to_cpupu((uint32_t *)config);
+}
+
+static inline void
 pci_config_set_vendor_id(uint8_t *pci_config, uint16_t val)
 {
-cpu_to_le16wu((uint16_t *)&pci_config[PCI_VENDOR_ID], val);
+pci_set_word(&pci_config[PCI_VENDOR_ID], val);
 }
 
 static inline void
 pci_config_set_device_id(uint8_t *pci_config, uint16_t val)
 {
-cpu_to_le16wu((uint16_t *)&pci_config[PCI_DEVICE_ID], val);
+pci_set_word(&pci_config[PCI_DEVICE_ID], val);
 }
 
 static inline void
 pci_config_set_class(uint8_t *pci_config, uint16_t val)
 {
-cpu_to_le16wu((uint16_t *)&pci_config[PCI_CLASS_DEVICE], val);
+pci_set_word(&pci_config[PCI_CLASS_DEVICE], val);
 }
 
 typedef void (*pci_qdev_initfn)(PCIDevice *dev);
-- 
1.6.3.1.56.g79e1.dirty

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 06/13] qemu: add flag to disable MSI-X by default

2009-06-02 Thread Michael S. Tsirkin
Add global flag to disable MSI-X by default.  This is useful primarily
to make images loadable by older qemu (without msix).  Even when MSI-X
is disabled by flag, you can still load images that have MSI-X enabled.

Signed-off-by: Michael S. Tsirkin 
---
 hw/msix.c   |3 +++
 qemu-options.hx |2 ++
 vl.c|3 +++
 3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index 4db2fc1..970a8ca 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -122,6 +122,9 @@ void msix_write_config(PCIDevice *dev, uint32_t addr,
uint32_t val, int len)
 {
 unsigned enable_pos = dev->msix_cap + MSIX_ENABLE_OFFSET;
+if (!(dev->cap_present & QEMU_PCI_CAP_MSIX))
+return;
+
 if (addr + len <= enable_pos || addr > enable_pos)
 return;
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 87af798..fd041a4 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1575,3 +1575,5 @@ DEF("semihosting", 0, QEMU_OPTION_semihosting,
 DEF("old-param", 0, QEMU_OPTION_old_param,
 "-old-param  old param mode\n")
 #endif
+DEF("disable-msix", 0, QEMU_OPTION_disable_msix,
+"-disable-msix disable msix support for PCI devices (enabled by 
default)\n")
diff --git a/vl.c b/vl.c
index 2c1f0e0..2757d4f 100644
--- a/vl.c
+++ b/vl.c
@@ -134,6 +134,7 @@ int main(int argc, char **argv)
 #include "hw/usb.h"
 #include "hw/pcmcia.h"
 #include "hw/pc.h"
+#include "hw/msix.h"
 #include "hw/audiodev.h"
 #include "hw/isa.h"
 #include "hw/baum.h"
@@ -5557,6 +5558,8 @@ int main(int argc, char **argv, char **envp)
 xen_mode = XEN_ATTACH;
 break;
 #endif
+case QEMU_OPTION_disable_msix:
+msix_disable = 1;
 }
 }
 }
-- 
1.6.3.1.56.g79e1.dirty

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 03/13] qemu: add routines to manage PCI capabilities

2009-06-02 Thread Michael S. Tsirkin
Add routines to manage PCI capability list. First user will be MSI-X.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |   98 --
 hw/pci.h |   18 +++-
 2 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 5dcfb4e..31ba2ed 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -130,12 +130,13 @@ void pci_device_save(PCIDevice *s, QEMUFile *f)
 int version = s->cap_present ? 3 : 2;
 int i;
 
-qemu_put_be32(f, version); /* PCI device version */
+/* PCI device version and capabilities */
+qemu_put_be32(f, version);
+if (version >= 3)
+qemu_put_be32(f, s->cap_present);
 qemu_put_buffer(f, s->config, 256);
 for (i = 0; i < 4; i++)
 qemu_put_be32(f, s->irq_state[i]);
-if (version >= 3)
-qemu_put_be32(f, s->cap_present);
 }
 
 int pci_device_load(PCIDevice *s, QEMUFile *f)
@@ -146,12 +147,6 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 version_id = qemu_get_be32(f);
 if (version_id > 3)
 return -EINVAL;
-qemu_get_buffer(f, s->config, 256);
-pci_update_mappings(s);
-
-if (version_id >= 2)
-for (i = 0; i < 4; i ++)
-s->irq_state[i] = qemu_get_be32(f);
 if (version_id >= 3)
 s->cap_present = qemu_get_be32(f);
 else
@@ -160,6 +155,18 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 if (s->cap_present & ~s->cap_supported)
 return -EINVAL;
 
+qemu_get_buffer(f, s->config, 256);
+pci_update_mappings(s);
+
+if (version_id >= 2)
+for (i = 0; i < 4; i ++)
+s->irq_state[i] = qemu_get_be32(f);
+/* Clear mask and used bits for capabilities.
+   Must be restored separately, since capabilities can
+   be placed anywhere in config space. */
+memset(s->used, 0, PCI_CONFIG_SPACE_SIZE);
+for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i)
+s->mask[i] = 0xff;
 return 0;
 }
 
@@ -870,3 +877,76 @@ PCIDevice *pci_create_simple(PCIBus *bus, int devfn, const 
char *name)
 
 return (PCIDevice *)dev;
 }
+
+static int pci_find_space(PCIDevice *pdev, uint8_t size)
+{
+int offset = PCI_CONFIG_HEADER_SIZE;
+int i;
+for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i)
+if (pdev->used[i])
+offset = i + 1;
+else if (i - offset + 1 == size)
+return offset;
+return 0;
+}
+
+static uint8_t pci_find_capability_list(PCIDevice *pdev, uint8_t cap_id,
+uint8_t *prev_p)
+{
+uint8_t next, prev;
+
+if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST))
+return 0;
+
+for (prev = PCI_CAPABILITY_LIST; (next = pdev->config[prev]);
+ prev = next + PCI_CAP_LIST_NEXT)
+if (pdev->config[next + PCI_CAP_LIST_ID] == cap_id)
+break;
+
+*prev_p = prev;
+return next;
+}
+
+/* Reserve space and add capability to the linked list in pci config space */
+int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size)
+{
+uint8_t offset = pci_find_space(pdev, size);
+uint8_t *config = pdev->config + offset;
+if (!offset)
+return -ENOSPC;
+config[PCI_CAP_LIST_ID] = cap_id;
+config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
+pdev->config[PCI_CAPABILITY_LIST] = offset;
+pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
+memset(pdev->used + offset, 0xFF, size);
+/* Make capability read-only by default */
+memset(pdev->mask + offset, 0, size);
+return offset;
+}
+
+/* Unlink capability from the pci config space. */
+void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size)
+{
+uint8_t prev, offset = pci_find_capability_list(pdev, cap_id, &prev);
+if (!offset)
+return;
+pdev->config[prev] = pdev->config[offset + PCI_CAP_LIST_NEXT];
+/* Make capability writeable again */
+memset(pdev->mask + offset, 0xff, size);
+memset(pdev->used + offset, 0, size);
+
+if (!pdev->config[PCI_CAPABILITY_LIST])
+pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST;
+}
+
+/* Reserve space for capability at a known offset (to call after load). */
+void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size)
+{
+memset(pdev->used + offset, 0xff, size);
+}
+
+uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id)
+{
+uint8_t prev;
+return pci_find_capability_list(pdev, cap_id, &prev);
+}
diff --git a/hw/pci.h b/hw/pci.h
index 9139504..40137c6 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -123,6 +123,10 @@ typedef struct PCIIORegion {
 #define PCI_MIN_GNT0x3e/* 8 bits */
 #define PCI_MAX_LAT0x3f/* 8 bits */
 
+/* Capability lists */
+#define PCI_CAP_LIST_ID0   /* Capability ID */
+#define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
+
 #define PCI_REVISION0x08/* obsolete, use PCI_REVISION_ID */
 #define

[PATCHv2 02/13] qemu: capability bits in pci save/restore

2009-06-02 Thread Michael S. Tsirkin
Add support for capability bits in save/restore for pci.
These will be used for MSI, where the capability might
be present or not as requested by user, which does not
map well into a single version number.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |   14 --
 hw/pci.h |4 
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 1e70e46..5dcfb4e 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -127,12 +127,15 @@ int pci_bus_num(PCIBus *s)
 
 void pci_device_save(PCIDevice *s, QEMUFile *f)
 {
+int version = s->cap_present ? 3 : 2;
 int i;
 
-qemu_put_be32(f, 2); /* PCI device version */
+qemu_put_be32(f, version); /* PCI device version */
 qemu_put_buffer(f, s->config, 256);
 for (i = 0; i < 4; i++)
 qemu_put_be32(f, s->irq_state[i]);
+if (version >= 3)
+qemu_put_be32(f, s->cap_present);
 }
 
 int pci_device_load(PCIDevice *s, QEMUFile *f)
@@ -141,7 +144,7 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 int i;
 
 version_id = qemu_get_be32(f);
-if (version_id > 2)
+if (version_id > 3)
 return -EINVAL;
 qemu_get_buffer(f, s->config, 256);
 pci_update_mappings(s);
@@ -149,6 +152,13 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 if (version_id >= 2)
 for (i = 0; i < 4; i ++)
 s->irq_state[i] = qemu_get_be32(f);
+if (version_id >= 3)
+s->cap_present = qemu_get_be32(f);
+else
+s->cap_present = 0;
+
+if (s->cap_present & ~s->cap_supported)
+return -EINVAL;
 
 return 0;
 }
diff --git a/hw/pci.h b/hw/pci.h
index c0c2380..9139504 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -178,6 +178,10 @@ struct PCIDevice {
 
 /* Current IRQ levels.  Used internally by the generic PCI code.  */
 int irq_state[4];
+
+/* Capability bits for save/load */
+uint32_t cap_supported;
+uint32_t cap_present;
 };
 
 PCIDevice *pci_register_device(PCIBus *bus, const char *name,
-- 
1.6.3.1.56.g79e1.dirty

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv2 01/13] qemu: make default_write_config use mask table

2009-06-02 Thread Michael S. Tsirkin
Change much of hw/pci to use symbolic constants and a table-driven
design: add a mask table with writable bits set and readonly bits unset.
Detect change by comparing original and new registers.

This makes it easy to support capabilities where read-only/writeable
bit layout differs between devices, depending on capabilities present.

As a result, writing a single byte in BAR registers now works as
it should. Writing to upper limit registers in the bridge
also works as it should. Code is also shorter.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |  145 -
 hw/pci.h |   18 +++-
 2 files changed, 46 insertions(+), 117 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 0ab5b94..1e70e46 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -239,6 +239,17 @@ int pci_assign_devaddr(const char *addr, int *domp, int 
*busp, unsigned *slotp)
 return pci_parse_devaddr(devaddr, domp, busp, slotp);
 }
 
+static void pci_init_mask(PCIDevice *dev)
+{
+int i;
+dev->mask[PCI_CACHE_LINE_SIZE] = 0xff;
+dev->mask[PCI_INTERRUPT_LINE] = 0xff;
+dev->mask[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY
+ | PCI_COMMAND_MASTER;
+for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i)
+dev->mask[i] = 0xff;
+}
+
 /* -1 for devfn means auto assign */
 static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
  const char *name, int devfn,
@@ -261,6 +272,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
 memset(pci_dev->irq_state, 0, sizeof(pci_dev->irq_state));
 pci_set_default_subsystem_id(pci_dev);
+pci_init_mask(pci_dev);
 
 if (!config_read)
 config_read = pci_default_read_config;
@@ -334,6 +346,7 @@ void pci_register_io_region(PCIDevice *pci_dev, int 
region_num,
 {
 PCIIORegion *r;
 uint32_t addr;
+uint32_t mask;
 
 if ((unsigned int)region_num >= PCI_NUM_REGIONS)
 return;
@@ -349,12 +362,17 @@ void pci_register_io_region(PCIDevice *pci_dev, int 
region_num,
 r->size = size;
 r->type = type;
 r->map_func = map_func;
+
+mask = ~(size - 1);
 if (region_num == PCI_ROM_SLOT) {
 addr = 0x30;
+/* ROM enable bit is writeable */
+mask |= 1;
 } else {
 addr = 0x10 + region_num * 4;
 }
 *(uint32_t *)(pci_dev->config + addr) = cpu_to_le32(type);
+*(uint32_t *)(pci_dev->mask + addr) = cpu_to_le32(mask);
 }
 
 static void pci_update_mappings(PCIDevice *d)
@@ -463,118 +481,21 @@ uint32_t pci_default_read_config(PCIDevice *d,
 return val;
 }
 
-void pci_default_write_config(PCIDevice *d,
-  uint32_t address, uint32_t val, int len)
+void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
 {
-int can_write, i;
-uint32_t end, addr;
-
-if (len == 4 && ((address >= 0x10 && address < 0x10 + 4 * 6) ||
- (address >= 0x30 && address < 0x34))) {
-PCIIORegion *r;
-int reg;
+uint8_t orig[PCI_CONFIG_SPACE_SIZE];
+int i;
 
-if ( address >= 0x30 ) {
-reg = PCI_ROM_SLOT;
-}else{
-reg = (address - 0x10) >> 2;
-}
-r = &d->io_regions[reg];
-if (r->size == 0)
-goto default_config;
-/* compute the stored value */
-if (reg == PCI_ROM_SLOT) {
-/* keep ROM enable bit */
-val &= (~(r->size - 1)) | 1;
-} else {
-val &= ~(r->size - 1);
-val |= r->type;
-}
-*(uint32_t *)(d->config + address) = cpu_to_le32(val);
-pci_update_mappings(d);
-return;
-}
- default_config:
 /* not efficient, but simple */
-addr = address;
-for(i = 0; i < len; i++) {
-/* default read/write accesses */
-switch(d->config[0x0e]) {
-case 0x00:
-case 0x80:
-switch(addr) {
-case 0x00:
-case 0x01:
-case 0x02:
-case 0x03:
-case 0x06:
-case 0x07:
-case 0x08:
-case 0x09:
-case 0x0a:
-case 0x0b:
-case 0x0e:
-case 0x10 ... 0x27: /* base */
-case 0x2c ... 0x2f: /* read-only subsystem ID & vendor ID */
-case 0x30 ... 0x33: /* rom */
-case 0x3d:
-can_write = 0;
-break;
-default:
-can_write = 1;
-break;
-}
-break;
-default:
-case 0x01:
-switch(addr) {
-case 0x00:
-case 0x01:
-case 0x02:
-case 0x03:
-case 0x06:
-case 0x07:
-case 0x08:
-case 0x09:
-case 0x0a:
-case 0x

[PATCHv2 00/13] qemu: MSI-X support

2009-06-02 Thread Michael S. Tsirkin
Resending. Incorporated a minor fix pointed out by Isaku Yamahata.

Here is the port of MSI-X support patches to upstream qemu.
Please comment or commit.

This patchset adds generic support for MSI-X, adds implementation in
APIC, and uses MSI-X in virtio-net. At Paul's suggestion, I use stl_phy
to decouple APIC and MSI-X implementation.

This uses the mask table patch that I posted previously, and which is
now included in the series.

-- 
MST

Michael S. Tsirkin (13):
  qemu: make default_write_config use mask table
  qemu: capability bits in pci save/restore
  qemu: add routines to manage PCI capabilities
  qemu: helper routines for pci access.
  qemu: MSI-X support functions
  qemu: add flag to disable MSI-X by default
  qemu: minimal MSI/MSI-X implementation for PC
  qemu: add support for resizing regions
  qemu: virtio support for many interrupt vectors
  qemu: MSI-X support in virtio PCI
  qemu: request 3 vectors in virtio-net
  qemu: virtio save/load bindings
  qemu: add pci_get/set_byte

 Makefile.target|2 +-
 hw/apic.c  |   50 ++-
 hw/msix.c  |  426 
 hw/msix.h  |   35 +
 hw/pci.c   |  295 +++-
 hw/pci.h   |  105 -
 hw/syborg_virtio.c |   13 ++-
 hw/virtio-net.c|1 +
 hw/virtio-pci.c|  215 +-
 hw/virtio.c|   70 ++---
 hw/virtio.h|   14 ++-
 qemu-options.hx|2 +
 vl.c   |3 +
 13 files changed, 1017 insertions(+), 214 deletions(-)
 create mode 100644 hw/msix.c
 create mode 100644 hw/msix.h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] virtio_net: don't free buffers in xmit ring

2009-06-02 Thread Rusty Russell
On Tue, 2 Jun 2009 09:11:42 pm Mark McLoughlin wrote:
> On Tue, 2009-06-02 at 09:13 +0100, Mark McLoughlin wrote:
> > I think skb_orphan() is enough to prevent this, is it?
>
> Oops, I meant:
>
>   I don't think skb_orphan() is enough to prevent this, is it?

Yes, point taken.  No, it's not.  We could add nf_reset here too, which would 
be perfectly sane.

I'm tempted to avoid the question marks hanging over generic skb_orphan for 
now, and just do skb_orphan & nf_reset at the head of our driver.

Then look at making skb_orphan do nf_reset.  It makes sense, but I'd have to 
audit all the callers.

Thanks.
Rusty.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit

2009-06-02 Thread Rusty Russell
On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
> From: Patrick Ohly 
> Date: Mon, 01 Jun 2009 21:47:22 +0200
>
> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> >
> > Would it be possible to make the new skb_orphan() at the start of
> > dev_hard_start_xmit() conditionally so that it is not executed for
> > packets that are to be time stamped?
> >
> > As discussed before
> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> > socket pointer is required for sending back the send time stamp from
> > inside the device driver. Calling skb_orphan() unconditionally as in
> > this patch would break the hardware time stamping of outgoing packets.
>
> Indeed, we need to check that case, at a minimum.
>
> And there are other potentially other problems.  For example, I
> wonder how this interacts with the new TX MMAP af_packet support
> in net-next-2.6 :-/

I think I'll do this in the driver for now, and let's revisit doing it 
generically later?

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Rusty Russell
On Tue, 2 Jun 2009 05:37:30 pm Mark McLoughlin wrote:
> On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:
> > This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
> > "virtio: wean net driver off NETDEV_TX_BUSY".
> >
> > The complexity of queuing an skb (setting a tasklet to re-xmit) is
> > questionable,
>
> It certainly adds some subtle complexities to start_xmit()
>
> >  especially once we get rid of the other reason for the
> > tasklet in the next patch.
> >
> > If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.  It
> > might be frowned upon, but it's common and not going away any time
> > soon.
> >
> > Signed-off-by: Rusty Russell 
> > Cc: Herbert Xu 
> > ---
> >  drivers/net/virtio_net.c |   49
> > ++- 1 file changed, 11
> > insertions(+), 38 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> >
> > @@ -526,27 +517,14 @@ again:
> > /* Free up any pending old buffers before queueing new ones. */
> > free_old_xmit_skbs(vi);
> >
> > -   /* If we has a buffer left over from last time, send it now. */
> > -   if (unlikely(vi->last_xmit_skb) &&
> > -   xmit_skb(vi, vi->last_xmit_skb) != 0)
> > -   goto stop_queue;
> > +   /* Put new one in send queue and do transmit */
> > +   __skb_queue_head(&vi->send, skb);
> > +   if (likely(xmit_skb(vi, skb) == 0)) {
> > +   vi->svq->vq_ops->kick(vi->svq);
> > +   return NETDEV_TX_OK;
> > +   }
>
> Hmm, is it okay to leave the skb on the send queue if we return
> NETDEV_TX_BUSY?

Certainly not.  That's a bug.  Incremental fix is:

diff -u b/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- b/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -524,8 +524,9 @@
return NETDEV_TX_OK;
}
 
-   /* Ring too full for this packet. */
+   /* Ring too full for this packet, remove it from queue again. */
pr_debug("%s: virtio not prepared to send\n", dev->name);
+   __skb_unlink(skb, &vi->send);
netif_stop_queue(dev);
 
/* Activate callback for using skbs: if this returns false it

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Rusty Russell
On Tue, 2 Jun 2009 06:35:52 pm Herbert Xu wrote:
> On Fri, May 29, 2009 at 11:46:04PM +0930, Rusty Russell wrote:
> > This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
> > "virtio: wean net driver off NETDEV_TX_BUSY".
> >
> > The complexity of queuing an skb (setting a tasklet to re-xmit) is
> > questionable, especially once we get rid of the other reason for the
> > tasklet in the next patch.
> >
> > If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.  It
> > might be frowned upon, but it's common and not going away any time
> > soon.
>
> This looks like a step backwards to me.  If we have to do this
> to fix a bug, sure.  But just doing it for the sake of it smells
> wrong.

I disagree.  We've introduced a third queue, inside the driver, one element 
long.  That feels terribly, terribly hacky and wrong.

What do we do if it overflows?  Discard the packet, even if we have room in the 
tx queue.   And when do we flush this queue?  Well, that's a bit messy.  
Obviously we need to enable the tx interrupt when the tx queue is full, but we 
can't just netif_wake_queue, we have to also flush the queue.  We can't do that 
in an irq handler, since we need to block the normal tx path (or introduce 
another lock and disable interrupts in our xmit routine).  So we add a tasklet 
to do this re-transmission.

Or, we could just "return NETDEV_TX_BUSY;".  I like that :)

Hope that clarifies,
Rusty.



>
> Cheers,

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] virtio_net: don't free buffers in xmit ring

2009-06-02 Thread Rusty Russell
On Tue, 2 Jun 2009 05:43:42 pm Mark McLoughlin wrote:
> On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:
> > The virtio_net driver is complicated by the two methods of freeing old
> > xmit buffers (in addition to freeing old ones at the start of the xmit
> > path).
> >
> > The original code used a 1/10 second timer attached to xmit_free(),
> > reset on every xmit.  Before we orphaned skbs on xmit, the
> > transmitting userspace could block with a full socket until the timer
> > fired, the skb destructor was called, and they were re-woken.
>
> The timer was actually added to solve a hang when trying to unload
> nf_conntrack AFAIR - nf_conntrack was blocking on the skb being freed
> and we never actually freed it.
>
> I think skb_orphan() is enough to prevent this, is it?

Yep.

> > Unfortunately, VIRTIO_F_NOTIFY_ON_EMPTY makes quite a lot of work: a
> > host which is faster than the guest will fire the interrupt every xmit
> > packet (slowing the guest down further).
>
> Ouch. So, does simply disabling host support for
> VIRTIO_F_NOTIFY_ON_EMPTY speed up current guests?

That was my original change.  It cuts the interrupts, but there's enough else 
going on that there's not a reliably-measurable speedup (this is with lguest, 
haven't tested with kvm).

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] virtio_net: don't free buffers in xmit ring

2009-06-02 Thread Mark McLoughlin
On Tue, 2009-06-02 at 09:13 +0100, Mark McLoughlin wrote:
> On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:
> > The virtio_net driver is complicated by the two methods of freeing old
> > xmit buffers (in addition to freeing old ones at the start of the xmit
> > path).
> > 
> > The original code used a 1/10 second timer attached to xmit_free(),
> > reset on every xmit.  Before we orphaned skbs on xmit, the
> > transmitting userspace could block with a full socket until the timer
> > fired, the skb destructor was called, and they were re-woken.
> 
> The timer was actually added to solve a hang when trying to unload
> nf_conntrack AFAIR - nf_conntrack was blocking on the skb being freed
> and we never actually freed it.
> 
> I think skb_orphan() is enough to prevent this, is it?

Oops, I meant:

  I don't think skb_orphan() is enough to prevent this, is it?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Herbert Xu
On Fri, May 29, 2009 at 11:46:04PM +0930, Rusty Russell wrote:
> 
> This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
> "virtio: wean net driver off NETDEV_TX_BUSY".
> 
> The complexity of queuing an skb (setting a tasklet to re-xmit) is
> questionable, especially once we get rid of the other reason for the
> tasklet in the next patch.
> 
> If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.  It
> might be frowned upon, but it's common and not going away any time
> soon.

This looks like a step backwards to me.  If we have to do this
to fix a bug, sure.  But just doing it for the sake of it smells
wrong.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] virtio_net: don't free buffers in xmit ring

2009-06-02 Thread Mark McLoughlin
On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:
> The virtio_net driver is complicated by the two methods of freeing old
> xmit buffers (in addition to freeing old ones at the start of the xmit
> path).
> 
> The original code used a 1/10 second timer attached to xmit_free(),
> reset on every xmit.  Before we orphaned skbs on xmit, the
> transmitting userspace could block with a full socket until the timer
> fired, the skb destructor was called, and they were re-woken.

The timer was actually added to solve a hang when trying to unload
nf_conntrack AFAIR - nf_conntrack was blocking on the skb being freed
and we never actually freed it.

I think skb_orphan() is enough to prevent this, is it?

> So we added the VIRTIO_F_NOTIFY_ON_EMPTY feature: supporting devices
> send an interrupt (even if normally suppressed) on an empty xmit ring
> which makes us schedule xmit_tasklet().  This was a benchmark win.
> 
> Unfortunately, VIRTIO_F_NOTIFY_ON_EMPTY makes quite a lot of work: a
> host which is faster than the guest will fire the interrupt every xmit
> packet (slowing the guest down further).

Ouch. So, does simply disabling host support for
VIRTIO_F_NOTIFY_ON_EMPTY speed up current guests?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Mark McLoughlin
On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:

> This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
> "virtio: wean net driver off NETDEV_TX_BUSY".
> 
> The complexity of queuing an skb (setting a tasklet to re-xmit) is
> questionable,

It certainly adds some subtle complexities to start_xmit() 

>  especially once we get rid of the other reason for the
> tasklet in the next patch.
> 
> If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.  It
> might be frowned upon, but it's common and not going away any time
> soon.
> 
> Signed-off-by: Rusty Russell 
> Cc: Herbert Xu 
> ---
>  drivers/net/virtio_net.c |   49 
> ++-
>  1 file changed, 11 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
>  
> @@ -526,27 +517,14 @@ again:
>   /* Free up any pending old buffers before queueing new ones. */
>   free_old_xmit_skbs(vi);
>  
> - /* If we has a buffer left over from last time, send it now. */
> - if (unlikely(vi->last_xmit_skb) &&
> - xmit_skb(vi, vi->last_xmit_skb) != 0)
> - goto stop_queue;
> + /* Put new one in send queue and do transmit */
> + __skb_queue_head(&vi->send, skb);
> + if (likely(xmit_skb(vi, skb) == 0)) {
> + vi->svq->vq_ops->kick(vi->svq);
> + return NETDEV_TX_OK;
> + }

Hmm, is it okay to leave the skb on the send queue if we return
NETDEV_TX_BUSY?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit

2009-06-02 Thread David Miller
From: Patrick Ohly 
Date: Mon, 01 Jun 2009 21:47:22 +0200

> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> Would it be possible to make the new skb_orphan() at the start of
> dev_hard_start_xmit() conditionally so that it is not executed for
> packets that are to be time stamped?
> 
> As discussed before
> (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> socket pointer is required for sending back the send time stamp from
> inside the device driver. Calling skb_orphan() unconditionally as in
> this patch would break the hardware time stamping of outgoing packets.

Indeed, we need to check that case, at a minimum.

And there are other potentially other problems.  For example, I
wonder how this interacts with the new TX MMAP af_packet support
in net-next-2.6 :-/

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization