Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Wed, Nov 25, 2009 at 08:50:21PM +1030, Rusty Russell wrote: > On Wed, 25 Nov 2009 07:45:30 pm Michael S. Tsirkin wrote: > > Hmm, is it really worth it to save a header copy if it's linear? We are > > going to access it anyway, and it fits into one cacheline nicely. On > > the other hand we have more code making life harder for compiler and > > processor. > > Not sure: I think there would be many places where it would be useful. > > We do a similar thing in the kernel to inspect non-linear packets, and > it's served us well. You mean this gives measureable speedup? Okay ... > Cheers, > Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Wed, 25 Nov 2009 07:45:30 pm Michael S. Tsirkin wrote: > Hmm, is it really worth it to save a header copy if it's linear? We are > going to access it anyway, and it fits into one cacheline nicely. On > the other hand we have more code making life harder for compiler and > processor. Not sure: I think there would be many places where it would be useful. We do a similar thing in the kernel to inspect non-linear packets, and it's served us well. Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Wed, Nov 25, 2009 at 10:42:06AM +1030, Rusty Russell wrote: > On Tue, 24 Nov 2009 10:07:54 pm Michael S. Tsirkin wrote: > > On Tue, Nov 24, 2009 at 08:54:23AM +1030, Rusty Russell wrote: > > > On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: > > > > > > + skb = (struct sk_buff *)buf; > > > > > This cast is unnecessary, but a comment would be nice: > > > > > > > > Without this cast there is a compile warning. > > > > > > Hi Shirley, > > > > > >Looks like buf is a void *, so no cast should be necessary. But I > > > could > > > be reading the patch wrong. > > > > > > > > However, I question whether making it 16 byte is the right thing: the > > > > > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? > > > > > > > > Because in QEMU it requires 10 bytes header in a separately, so one page > > > > is used to share between virtio_net_hdr header which is 10 bytes head > > > > and rest of data. So I put 6 bytes offset here between two buffers. I > > > > didn't look at the reason why a seperate buf is used for virtio_net_hdr > > > > in QEMU. > > > > > > It's a qemu bug. It insists the header be an element in the scatterlist > > > by > > > itself. Unfortunately we have to accommodate it. > > > > We do? Let's just fix this? > > All we have to do is replace memcpy with proper iovec walk, correct? > > Something like the followng (untested) patch? It's probably not too > > late to put this in the next qemu release... > > You might want to implement a more generic helper which does: > > /* Return pointer into iovec if we can, otherwise copy into buf */ > void *pull_iovec(struct iovec *iov, int iovcnt, void *buf, size_t len) > { > unsigned int i; > void *p; > > if (likely(iov_cnt && iov[0].iov_len >= len)) { > /* Nice contiguous chunk. */ > void *p = iov[0].iov_base; > iov[i].iov_base += len; > iov[i].iov_len -= len; > return p; > } > > p = buf; > for (i = 0; i < iov_cnt; i++) { > size_t this_len = min(len, iov[i].iov_len); > memcpy(p, iov[i].iov_base, this_len); > len -= this_len; > iov[i].iov_base += len; > iov[i].iov_len -= len; > if (len == 0) > return buf; > } > /* BTW, we screwed your iovec. */ > return NULL; > } > > Then use it in all the virtio drivers... Hmm, is it really worth it to save a header copy if it's linear? We are going to access it anyway, and it fits into one cacheline nicely. On the other hand we have more code making life harder for compiler and processor. > Thanks! > Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Wed, 25 Nov 2009 01:06:32 am Anthony Liguori wrote: > So does lguest. It's been that way since the beginning. Fixing this > would result in breaking older guests. Agreed, we can't "fix" it in the guests, but it's a wart. That's why I haven't bothered fixing it, but if someone else wants to I'll cheer all the way. lguest did it because I knew I could fix lguest any time; it was a bad mistake and I will now fix lguest :) > We really need to introduce a feature bit if we want to change this. I don't think it's worth it. But the spec does say that the implementation should not rely on the framing (I think there's a note that current implementations are buggy tho, so you should frame it separately anyway). That way future devices can get it right, at least. Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, 24 Nov 2009 10:07:54 pm Michael S. Tsirkin wrote: > On Tue, Nov 24, 2009 at 08:54:23AM +1030, Rusty Russell wrote: > > On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: > > > > > + skb = (struct sk_buff *)buf; > > > > This cast is unnecessary, but a comment would be nice: > > > > > > Without this cast there is a compile warning. > > > > Hi Shirley, > > > >Looks like buf is a void *, so no cast should be necessary. But I could > > be reading the patch wrong. > > > > > > However, I question whether making it 16 byte is the right thing: the > > > > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? > > > > > > Because in QEMU it requires 10 bytes header in a separately, so one page > > > is used to share between virtio_net_hdr header which is 10 bytes head > > > and rest of data. So I put 6 bytes offset here between two buffers. I > > > didn't look at the reason why a seperate buf is used for virtio_net_hdr > > > in QEMU. > > > > It's a qemu bug. It insists the header be an element in the scatterlist by > > itself. Unfortunately we have to accommodate it. > > We do? Let's just fix this? > All we have to do is replace memcpy with proper iovec walk, correct? > Something like the followng (untested) patch? It's probably not too > late to put this in the next qemu release... You might want to implement a more generic helper which does: /* Return pointer into iovec if we can, otherwise copy into buf */ void *pull_iovec(struct iovec *iov, int iovcnt, void *buf, size_t len) { unsigned int i; void *p; if (likely(iov_cnt && iov[0].iov_len >= len)) { /* Nice contiguous chunk. */ void *p = iov[0].iov_base; iov[i].iov_base += len; iov[i].iov_len -= len; return p; } p = buf; for (i = 0; i < iov_cnt; i++) { size_t this_len = min(len, iov[i].iov_len); memcpy(p, iov[i].iov_base, this_len); len -= this_len; iov[i].iov_base += len; iov[i].iov_len -= len; if (len == 0) return buf; } /* BTW, we screwed your iovec. */ return NULL; } Then use it in all the virtio drivers... Thanks! Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, Nov 24, 2009 at 08:36:32AM -0600, Anthony Liguori wrote: > Michael S. Tsirkin wrote: >> On Tue, Nov 24, 2009 at 08:54:23AM +1030, Rusty Russell wrote: >> >>> On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: >>> >> + skb = (struct sk_buff *)buf; >> > This cast is unnecessary, but a comment would be nice: > Without this cast there is a compile warning. >>> Hi Shirley, >>> >>>Looks like buf is a void *, so no cast should be necessary. But I could >>> be reading the patch wrong. >>> >>> > However, I question whether making it 16 byte is the right thing: the > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? > Because in QEMU it requires 10 bytes header in a separately, so one page is used to share between virtio_net_hdr header which is 10 bytes head and rest of data. So I put 6 bytes offset here between two buffers. I didn't look at the reason why a seperate buf is used for virtio_net_hdr in QEMU. >>> It's a qemu bug. It insists the header be an element in the scatterlist by >>> itself. Unfortunately we have to accommodate it. >>> >> >> We do? Let's just fix this? >> > > So does lguest. It does? All I see it doing is writev/readv, and this passes things to tap which handles this correctly. > It's been that way since the beginning. Fixing this > would result in breaking older guests. If you look at my patch, it handles old guests just fine :). > We really need to introduce a feature bit if we want to change this. I am not sure I agree: we can't add feature bits for all bugs, can we? -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, Nov 24, 2009 at 08:36:32AM -0600, Anthony Liguori wrote: > Michael S. Tsirkin wrote: >> On Tue, Nov 24, 2009 at 08:54:23AM +1030, Rusty Russell wrote: >> >>> On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: >>> >> + skb = (struct sk_buff *)buf; >> > This cast is unnecessary, but a comment would be nice: > Without this cast there is a compile warning. >>> Hi Shirley, >>> >>>Looks like buf is a void *, so no cast should be necessary. But I could >>> be reading the patch wrong. >>> >>> > However, I question whether making it 16 byte is the right thing: the > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? > Because in QEMU it requires 10 bytes header in a separately, so one page is used to share between virtio_net_hdr header which is 10 bytes head and rest of data. So I put 6 bytes offset here between two buffers. I didn't look at the reason why a seperate buf is used for virtio_net_hdr in QEMU. >>> It's a qemu bug. It insists the header be an element in the scatterlist by >>> itself. Unfortunately we have to accommodate it. >>> >> >> We do? Let's just fix this? >> > > So does lguest. It's been that way since the beginning. Fixing this > would result in breaking older guests. The patch you are replying to fixes this in a way that does not break older guests. > We really need to introduce a feature bit if we want to change this. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
Michael S. Tsirkin wrote: On Tue, Nov 24, 2009 at 08:54:23AM +1030, Rusty Russell wrote: On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: + skb = (struct sk_buff *)buf; This cast is unnecessary, but a comment would be nice: Without this cast there is a compile warning. Hi Shirley, Looks like buf is a void *, so no cast should be necessary. But I could be reading the patch wrong. However, I question whether making it 16 byte is the right thing: the ethernet header is 14 bytes long, so don't we want 8 bytes of padding? Because in QEMU it requires 10 bytes header in a separately, so one page is used to share between virtio_net_hdr header which is 10 bytes head and rest of data. So I put 6 bytes offset here between two buffers. I didn't look at the reason why a seperate buf is used for virtio_net_hdr in QEMU. It's a qemu bug. It insists the header be an element in the scatterlist by itself. Unfortunately we have to accommodate it. We do? Let's just fix this? So does lguest. It's been that way since the beginning. Fixing this would result in breaking older guests. We really need to introduce a feature bit if we want to change this. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, Nov 24, 2009 at 08:54:23AM +1030, Rusty Russell wrote: > On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: > > > > + skb = (struct sk_buff *)buf; > > > This cast is unnecessary, but a comment would be nice: > > > > Without this cast there is a compile warning. > > Hi Shirley, > >Looks like buf is a void *, so no cast should be necessary. But I could > be reading the patch wrong. > > > > However, I question whether making it 16 byte is the right thing: the > > > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? > > > > Because in QEMU it requires 10 bytes header in a separately, so one page > > is used to share between virtio_net_hdr header which is 10 bytes head > > and rest of data. So I put 6 bytes offset here between two buffers. I > > didn't look at the reason why a seperate buf is used for virtio_net_hdr > > in QEMU. > > It's a qemu bug. It insists the header be an element in the scatterlist by > itself. Unfortunately we have to accommodate it. We do? Let's just fix this? All we have to do is replace memcpy with proper iovec walk, correct? Something like the followng (untested) patch? It's probably not too late to put this in the next qemu release... Signed-off-by: Michael S. Tsirkin diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 2f147e5..06c5148 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -434,26 +434,59 @@ static int iov_fill(struct iovec *iov, int iovcnt, const void *buf, int count) return offset; } +static int iov_skip(struct iovec *iov, int iovcnt, int count) +{ +int offset, i; + +offset = i = 0; +while (offset < count && i < iovcnt) { +int len = MIN(iov[i].iov_len, count - offset); + iov[i].iov_base += len; + iov[i].iov_len -= len; +offset += len; +i++; +} + +return offset; +} + +static int iov_copy(struct iovec *to, struct iovec *from, int iovcnt, int count) +{ +int offset, i; + +offset = i = 0; +while (offset < count && i < iovcnt) { +int len = MIN(from[i].iov_len, count - offset); + to[i].iov_base = from[i].iov_base; + to[i].iov_len = from[i].iov_len; +offset += len; +i++; +} + +return i; +} + static int receive_header(VirtIONet *n, struct iovec *iov, int iovcnt, const void *buf, size_t size, size_t hdr_len) { -struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)iov[0].iov_base; +struct virtio_net_hdr hdr = {}; int offset = 0; -hdr->flags = 0; -hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; +hdr.flags = 0; +hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE; if (n->has_vnet_hdr) { -memcpy(hdr, buf, sizeof(*hdr)); -offset = sizeof(*hdr); -work_around_broken_dhclient(hdr, buf + offset, size - offset); +memcpy(&hdr, buf, sizeof hdr); +offset = sizeof hdr; +work_around_broken_dhclient(&hdr, buf + offset, size - offset); } +iov_fill(iov, iovcnt, &hdr, sizeof hdr); + /* We only ever receive a struct virtio_net_hdr from the tapfd, * but we may be passing along a larger header to the guest. */ -iov[0].iov_base += hdr_len; -iov[0].iov_len -= hdr_len; +iov_skip(iov, iovcnt, hdr_len); return offset; } @@ -514,7 +547,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size) static ssize_t virtio_net_receive(VLANClientState *vc, const uint8_t *buf, size_t size) { VirtIONet *n = vc->opaque; -struct virtio_net_hdr_mrg_rxbuf *mhdr = NULL; +struct iovec mhdr[VIRTQUEUE_MAX_SIZE]; +int mhdrcnt = 0; size_t hdr_len, offset, i; if (!virtio_net_can_receive(n->vc)) @@ -552,16 +586,13 @@ static ssize_t virtio_net_receive(VLANClientState *vc, const uint8_t *buf, size_ exit(1); } -if (!n->mergeable_rx_bufs && elem.in_sg[0].iov_len != hdr_len) { -fprintf(stderr, "virtio-net header not in first element\n"); -exit(1); -} - memcpy(&sg, &elem.in_sg[0], sizeof(sg[0]) * elem.in_num); if (i == 0) { -if (n->mergeable_rx_bufs) -mhdr = (struct virtio_net_hdr_mrg_rxbuf *)sg[0].iov_base; +if (n->mergeable_rx_bufs) { +mhdrcnt = iov_copy(mhdr, sg, elem.in_num, + sizeof(struct virtio_net_hdr_mrg_rxbuf)); +} offset += receive_header(n, sg, elem.in_num, buf + offset, size - offset, hdr_len); @@ -579,8 +610,12 @@ static ssize_t virtio_net_receive(VLANClientState *vc, const uint8_t *buf, size_ offset += len; } -if (mhdr) -mhdr->num_buffers = i; +if (mhdrcnt) { +uint16_t num = i; +iov_skip(mhdr, mhdrcnt, + offsetof(struct virtio_net_hdr_mrg_rxbuf, num_buffers)); +iov_fill(mhdr, mhdrcnt, &num, sizeof num); +} virtqueue_flush(n->rx_v
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, 2009-11-24 at 08:54 +1030, Rusty Russell wrote: > #define BIG_PACKET_PAD (NET_SKB_PAD - sizeof(struct virtio_net_hdr) + > NET_IP_ALIGN) > struct big_packet_page { > struct virtio_net_hdr hdr; > char pad[BIG_PACKET_PAD]; > /* Actual packet data starts here */ > unsigned char data[PAGE_SIZE - BIG_PACKET_PAD - sizeof(struct > virtio_net_hdr)]; > }; The padding was used for qemu userspace buffer copy, skb data is copied from the page for size of GOOD_COPY_LEN, and skb data is reserved NET_IP_ALIGN. If we add paddings as above, userspace copy will starts NET_SKB_PAD + NET_IP_ALIGN? I am little bit confused here. > Then use this type as the template for registering the sg list for the > big packet case. > > > > I think you can move the memcpy outside the if, like so: > > > > > > memcpy(&hdr, p, hdr_len); > > > > I didn't move it out, because num_buf = hdr->mhdr.num_buffers; > > Yes, that has to stay inside, but the memcpy can move out. It's just > a bit > neater to have more common code. num_buf is assigned after memcpy so we couldn't move memcpy out. Anyway I separated mergeable buffers and big packets in the new patch to make it clear and it will be easy to modify when we drop big packets support in the future. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Tue, 24 Nov 2009 02:37:01 am Shirley Ma wrote: > > > + skb = (struct sk_buff *)buf; > > This cast is unnecessary, but a comment would be nice: > > Without this cast there is a compile warning. Hi Shirley, Looks like buf is a void *, so no cast should be necessary. But I could be reading the patch wrong. > > However, I question whether making it 16 byte is the right thing: the > > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? > > Because in QEMU it requires 10 bytes header in a separately, so one page > is used to share between virtio_net_hdr header which is 10 bytes head > and rest of data. So I put 6 bytes offset here between two buffers. I > didn't look at the reason why a seperate buf is used for virtio_net_hdr > in QEMU. It's a qemu bug. It insists the header be an element in the scatterlist by itself. Unfortunately we have to accommodate it. However, there's no reason for the merged rxbuf and big packet layout to be the same: for the big packet case you should layout as follows: #define BIG_PACKET_PAD (NET_SKB_PAD - sizeof(struct virtio_net_hdr) + NET_IP_ALIGN) struct big_packet_page { struct virtio_net_hdr hdr; char pad[BIG_PACKET_PAD]; /* Actual packet data starts here */ unsigned char data[PAGE_SIZE - BIG_PACKET_PAD - sizeof(struct virtio_net_hdr)]; }; Then use this type as the template for registering the sg list for the big packet case. > > I think you can move the memcpy outside the if, like so: > > > > memcpy(&hdr, p, hdr_len); > > I didn't move it out, because num_buf = hdr->mhdr.num_buffers; Yes, that has to stay inside, but the memcpy can move out. It's just a bit neater to have more common code. Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
Hello Rusty, On Mon, 2009-11-23 at 11:38 +1030, Rusty Russell wrote: > Overall, the patch looks good. But it would have been nicer if it > were > split into several parts: cleanups, new infrastructure, then the > actual > allocation change. I have split the patch into a set: cleanups, new infrastructure, and actual allocation change in add buffers: add_recvbuf_big, add_recvbuf_small, add_recvbuf_mergeage per your suggestion, and also in recv buffers: recv_big, recv_small, recv_mergeable. I hope you will agree with it. I am testing the patch-set now, will submit it soon after finishing all test cases. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Mon, 2009-11-23 at 11:43 +0200, Michael S. Tsirkin wrote: > should be !npage->private > and nesting is too deep here: > this is cleaner in a give_a_page subroutine > as it was. This will be addressed with Rusty's comment. > > + /* use private to chain big packets */ > > packets? or pages? Will change it to chain pages for big packets > > + p->private = (unsigned long)0; > > the comment is not really helpful: > you say you use private to chain but 0 does not > chain anything. You also do not need the cast to long? Ok. > > + if (len > (PAGE_SIZE - f->page_offset)) > > brackets around math are not needed. OK. > typo > > > +* header and data */ Got it. > please think of a way to get rid of magic constants like 6 and 2 > here and elsewhere. Will do. > replace MAX_SKB_FRAGS + 2 with something symbolic? We have it in 2 palces now. > And comment. Ok, I can change it. > terrible goto based loop > move stuff into subfunction, it will be much > more manageable, and convert this to a simple > for loop. Will change it to different functions based on Rusty's comment. > and here it is MAX_SKB_FRAGS + 1 I think I should use MAX_SKB_FRAGS + 2 instead. Now I only use MAX_SKB_FRAGS + 1 but allocated + 2. > > + page->private = (unsigned long)first_page; > > + first_page = page; > > + if (--i == 1) { > > this is pretty hairy ... has to be this way? > What you are trying to do here > is fill buffer with pages, in a loop, with first one > using a partial page, and then add it. > Is that it? Yes. > So please code this in a straight forward manner. > it should be as simple as: > offset = XXX > for (i = 0; i < MAX_SKB_FRAGS + 2; ++i) { > > sg_set_buf(sg + i, p + offset, PAGE_SIZE - offset); > offset = 0; > > } Ok, looks more neat. > space around + > sg + 1 here is same as &sg[i] in fact? Ok. > callback -> destructor? Ok. I will integrate these comments with Rusty's and resubmit the patch set. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Mon, 2009-11-23 at 11:38 +1030, Rusty Russell wrote: > How about: > struct page *end; > > /* Find end of list, sew whole thing into vi->pages. */ > for (end = page; end->private; end = (struct page > *)end->private); > end->private = (unsigned long)vi->pages; > vi->pages = page; Yes, this looks nicer. > > +void virtio_free_pages(void *buf) > > This is a terrible name; it's specific to virtio_net, plus it should > be > static. Just "free_pages" should be sufficient here I think. > > This smells a lot like a for loop to me? > > struct page *i, *next; > > for (i = buf; next; i = next) { > next = (struct page *)i->private; > __free_pages(i, 0); > } Agree, will change it. > > +static int set_skb_frags(struct sk_buff *skb, struct page *page, > > + int offset, int len) > > A better name might be "skb_add_frag()"? OK. > > + skb = (struct sk_buff *)buf; > This cast is unnecessary, but a comment would be nice: > /* Simple case: the pointer is the 1514-byte skb */ > > > + } else { Without this cast there is a compile warning. > And a matching comment here: > > /* The pointer is a chain of pages. */ > OK. > > + if (unlikely(!skb)) { > > + dev->stats.rx_dropped++; > > Does this mean we leak all those pages? Shouldn't we give_pages() > here? Yes, it should call give_pages here. > > + offset = hdr_len + 6; > > Really? I can't see where the current driver does this 6 byte offset. > There > should be a header, then the packet... > Ah, I see below that you set things up this way. The better way to do > this > is to create a new structure to use in various places. > > struct padded_vnet_hdr { > struct virtio_net_hdr hdr; > /* This padding makes our packet 16 byte aligned */ > char padding[6]; > }; > > However, I question whether making it 16 byte is the right thing: the > ethernet header is 14 bytes long, so don't we want 8 bytes of padding? Because in QEMU it requires 10 bytes header in a separately, so one page is used to share between virtio_net_hdr header which is 10 bytes head and rest of data. So I put 6 bytes offset here between two buffers. I didn't look at the reason why a seperate buf is used for virtio_net_hdr in QEMU. > > + } > > I think you can move the memcpy outside the if, like so: > > memcpy(&hdr, p, hdr_len); I didn't move it out, because num_buf = hdr->mhdr.num_buffers; > > + if (!len) > > + give_pages(vi, page); > > + else { > > + len = set_skb_frags(skb, page, copy + offset, > len); > > + /* process big packets */ > > + while (len > 0) { > > + page = (struct page *)page->private; > > + if (!page) > > + break; > > + len = set_skb_frags(skb, page, 0, > len); > > + } > > + if (page && page->private) > > + give_pages(vi, (struct page > *)page->private); > > } > > I can't help but think this statement should be one loop somehow. > Something > like: > > offset += copy; > > while (len) { > len = skb_add_frag(skb, page, offset, len); > page = (struct page *)page->private; > offset = 0; > } > if (page) > give_pages(vi, page); I was little bit worried about qemu messed up len (i saw code in mergeable buffer checking len > PAGE_SIZE), so I put page checking inside. I will change it outside if checking len is enough. > > -static bool try_fill_recv_maxbufs(struct virtnet_info *vi, gfp_t > gfp) > > +/* Returns false if we couldn't fill entirely (OOM). */ > > +static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp) > > The result of trying to merge all the cases here is messy. I'd split > it > inside the loop into: add_recvbuf_small(vi, gfp), add_recvbuf_big(vi, > gfp) > and add_recvbuf_mergeable(vi, gfp). > > It'll also be easier for me to review each case then, as well as > getting > rid of the big packets if we decide to do that later. You could even > do > that split as a separate patch, before this one. Ok, will separate it. > destroy_buf should really be called destroy_bufs(). And I don't think > you > use the vi->recv skb list in this case at all, so the loop which > purges those > should be under an "else {" of this branch. Yes. > This new parameter should be introduced as a separate patch. It > should also be > called destroy_bufs. It also returns an unsigned int. I would call > the callback > something a little more informative, such as "destroy"; here and in > the hea
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Fri, Nov 20, 2009 at 08:21:41AM -0800, Shirley Ma wrote: > On Fri, 2009-11-20 at 07:19 +0100, Eric Dumazet wrote: > > Interesting use after free :) > > Thanks for catching the stupid mistake. This is the updated patch for > review. > > Signed-off-by: Shirley Ma (x...@us.ibm.com) some style comments. addressing them will make it easier to review actual content. > -- > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index b9e002f..5699bd3 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -80,33 +80,50 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct > sk_buff *skb) > return (struct skb_vnet_hdr *)skb->cb; > } > > -static void give_a_page(struct virtnet_info *vi, struct page *page) > +static void give_pages(struct virtnet_info *vi, struct page *page) > { > - page->private = (unsigned long)vi->pages; > + struct page *npage = (struct page *)page->private; > + > + if (!npage) > + page->private = (unsigned long)vi->pages; > + else { > + /* give a page list */ > + while (npage) { > + if (npage->private == (unsigned long)0) { should be !npage->private and nesting is too deep here: this is cleaner in a give_a_page subroutine as it was. > + npage->private = (unsigned long)vi->pages; > + break; > + } > + npage = (struct page *)npage->private; > + } > + } > vi->pages = page; > } > > -static void trim_pages(struct virtnet_info *vi, struct sk_buff *skb) > -{ > - unsigned int i; > - > - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > - give_a_page(vi, skb_shinfo(skb)->frags[i].page); > - skb_shinfo(skb)->nr_frags = 0; > - skb->data_len = 0; > -} > - > static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) so in short, we are constantly walking a linked > { > struct page *p = vi->pages; > > - if (p) > + if (p) { > vi->pages = (struct page *)p->private; > - else > + /* use private to chain big packets */ packets? or pages? > + p->private = (unsigned long)0; the comment is not really helpful: you say you use private to chain but 0 does not chain anything. You also do not need the cast to long? > + } else > p = alloc_page(gfp_mask); > return p; > } > > +void virtio_free_pages(void *buf) > +{ > + struct page *page = (struct page *)buf; > + struct page *npage; > + > + while (page) { > + npage = (struct page *)page->private; > + __free_pages(page, 0); > + page = npage; > + } > +} > + > static void skb_xmit_done(struct virtqueue *svq) > { > struct virtnet_info *vi = svq->vdev->priv; > @@ -118,12 +135,36 @@ static void skb_xmit_done(struct virtqueue *svq) > netif_wake_queue(vi->dev); > } > > -static void receive_skb(struct net_device *dev, struct sk_buff *skb, > +static int set_skb_frags(struct sk_buff *skb, struct page *page, > + int offset, int len) > +{ > + int i = skb_shinfo(skb)->nr_frags; > + skb_frag_t *f; > + > + i = skb_shinfo(skb)->nr_frags; > + f = &skb_shinfo(skb)->frags[i]; > + f->page = page; > + f->page_offset = offset; > + > + if (len > (PAGE_SIZE - f->page_offset)) brackets around math are not needed. > + f->size = PAGE_SIZE - f->page_offset; > + else > + f->size = len; > + > + skb_shinfo(skb)->nr_frags++; > + skb->data_len += f->size; > + skb->len += f->size; > + > + len -= f->size; > + return len; > +} > + > +static void receive_skb(struct net_device *dev, void *buf, > unsigned len) > { > struct virtnet_info *vi = netdev_priv(dev); > - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); > - int err; > + struct skb_vnet_hdr *hdr; > + struct sk_buff *skb; > int i; > > if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) { > @@ -132,39 +173,71 @@ static void receive_skb(struct net_device *dev, struct > sk_buff *skb, > goto drop; > } > > - if (vi->mergeable_rx_bufs) { > - unsigned int copy; > - char *p = page_address(skb_shinfo(skb)->frags[0].page); > + if (!vi->mergeable_rx_bufs && !vi->big_packets) { > + skb = (struct sk_buff *)buf; > + > + __skb_unlink(skb, &vi->recv); > + > + hdr = skb_vnet_hdr(skb); > + len -= sizeof(hdr->hdr); > + skb_trim(skb, len); > + } else { > + struct page *page = (struct page *)buf; > + int copy, hdr_len, num_buf, offset; > + char *p; > + > + p = page_address(page); > > - if (len > PAGE_SIZE) > - len = PAGE_SIZE; > - len -= sizeof(struct
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Sat, 21 Nov 2009 02:51:41 am Shirley Ma wrote: > Signed-off-by: Shirley Ma (x...@us.ibm.com) Hi Shirley, This patch seems like a good idea, but it needs some work. As this is the first time I've received a patch from you, I will go through it in extra detail. (As virtio maintainer, it's my responsibility to include this patch, or not). > -static void give_a_page(struct virtnet_info *vi, struct page *page) > +static void give_pages(struct virtnet_info *vi, struct page *page) > { > - page->private = (unsigned long)vi->pages; > + struct page *npage = (struct page *)page->private; > + > + if (!npage) > + page->private = (unsigned long)vi->pages; > + else { > + /* give a page list */ > + while (npage) { > + if (npage->private == (unsigned long)0) { > + npage->private = (unsigned long)vi->pages; > + break; > + } > + npage = (struct page *)npage->private; > + } > + } > vi->pages = page; The loops should cover both cases of the if branch. Either way, we want to find the last page in the list so you can set that ->private pointer to vi->pages. Also, the "== (unsigned long)0" is a little verbose. How about: struct page *end; /* Find end of list, sew whole thing into vi->pages. */ for (end = page; end->private; end = (struct page *)end->private); end->private = (unsigned long)vi->pages; vi->pages = page; > +void virtio_free_pages(void *buf) This is a terrible name; it's specific to virtio_net, plus it should be static. Just "free_pages" should be sufficient here I think. > +{ > + struct page *page = (struct page *)buf; > + struct page *npage; > + > + while (page) { > + npage = (struct page *)page->private; > + __free_pages(page, 0); > + page = npage; > + } This smells a lot like a for loop to me? struct page *i, *next; for (i = buf; next; i = next) { next = (struct page *)i->private; __free_pages(i, 0); } > +static int set_skb_frags(struct sk_buff *skb, struct page *page, > + int offset, int len) A better name might be "skb_add_frag()"? > +static void receive_skb(struct net_device *dev, void *buf, > unsigned len) > { > struct virtnet_info *vi = netdev_priv(dev); > - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); > - int err; > + struct skb_vnet_hdr *hdr; > + struct sk_buff *skb; > int i; > > if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) { > @@ -132,39 +173,71 @@ static void receive_skb(struct net_device *dev, struct > sk_buff *skb, > goto drop; > } > > - if (vi->mergeable_rx_bufs) { > - unsigned int copy; > - char *p = page_address(skb_shinfo(skb)->frags[0].page); > + if (!vi->mergeable_rx_bufs && !vi->big_packets) { > + skb = (struct sk_buff *)buf; This cast is unnecessary, but a comment would be nice: /* Simple case: the pointer is the 1514-byte skb */ > + } else { And a matching comment here: /* The pointer is a chain of pages. */ > + struct page *page = (struct page *)buf; > + int copy, hdr_len, num_buf, offset; > + char *p; > + > + p = page_address(page); > > - if (len > PAGE_SIZE) > - len = PAGE_SIZE; > - len -= sizeof(struct virtio_net_hdr_mrg_rxbuf); > + skb = netdev_alloc_skb(vi->dev, GOOD_COPY_LEN + NET_IP_ALIGN); > + if (unlikely(!skb)) { > + dev->stats.rx_dropped++; Does this mean we leak all those pages? Shouldn't we give_pages() here? > + return; > + } > + skb_reserve(skb, NET_IP_ALIGN); > + hdr = skb_vnet_hdr(skb); > > - memcpy(&hdr->mhdr, p, sizeof(hdr->mhdr)); > - p += sizeof(hdr->mhdr); > + if (vi->mergeable_rx_bufs) { > + hdr_len = sizeof(hdr->mhdr); > + memcpy(&hdr->mhdr, p, hdr_len); > + num_buf = hdr->mhdr.num_buffers; > + offset = hdr_len; > + if (len > PAGE_SIZE) > + len = PAGE_SIZE; > + } else { > + /* big packtes 6 bytes alignment between virtio_net > + * header and data */ > + hdr_len = sizeof(hdr->hdr); > + memcpy(&hdr->hdr, p, hdr_len); > + offset = hdr_len + 6; Really? I can't see where the current driver does this 6 byte offset. There should be a header, then the packet... Ah, I see below that you set things up this way. The better way to do this is to create a new
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Fri, 2009-11-20 at 07:19 +0100, Eric Dumazet wrote: > Interesting use after free :) Thanks for catching the stupid mistake. This is the updated patch for review. Signed-off-by: Shirley Ma (x...@us.ibm.com) -- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index b9e002f..5699bd3 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -80,33 +80,50 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb) return (struct skb_vnet_hdr *)skb->cb; } -static void give_a_page(struct virtnet_info *vi, struct page *page) +static void give_pages(struct virtnet_info *vi, struct page *page) { - page->private = (unsigned long)vi->pages; + struct page *npage = (struct page *)page->private; + + if (!npage) + page->private = (unsigned long)vi->pages; + else { + /* give a page list */ + while (npage) { + if (npage->private == (unsigned long)0) { + npage->private = (unsigned long)vi->pages; + break; + } + npage = (struct page *)npage->private; + } + } vi->pages = page; } -static void trim_pages(struct virtnet_info *vi, struct sk_buff *skb) -{ - unsigned int i; - - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) - give_a_page(vi, skb_shinfo(skb)->frags[i].page); - skb_shinfo(skb)->nr_frags = 0; - skb->data_len = 0; -} - static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) { struct page *p = vi->pages; - if (p) + if (p) { vi->pages = (struct page *)p->private; - else + /* use private to chain big packets */ + p->private = (unsigned long)0; + } else p = alloc_page(gfp_mask); return p; } +void virtio_free_pages(void *buf) +{ + struct page *page = (struct page *)buf; + struct page *npage; + + while (page) { + npage = (struct page *)page->private; + __free_pages(page, 0); + page = npage; + } +} + static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq->vdev->priv; @@ -118,12 +135,36 @@ static void skb_xmit_done(struct virtqueue *svq) netif_wake_queue(vi->dev); } -static void receive_skb(struct net_device *dev, struct sk_buff *skb, +static int set_skb_frags(struct sk_buff *skb, struct page *page, + int offset, int len) +{ + int i = skb_shinfo(skb)->nr_frags; + skb_frag_t *f; + + i = skb_shinfo(skb)->nr_frags; + f = &skb_shinfo(skb)->frags[i]; + f->page = page; + f->page_offset = offset; + + if (len > (PAGE_SIZE - f->page_offset)) + f->size = PAGE_SIZE - f->page_offset; + else + f->size = len; + + skb_shinfo(skb)->nr_frags++; + skb->data_len += f->size; + skb->len += f->size; + + len -= f->size; + return len; +} + +static void receive_skb(struct net_device *dev, void *buf, unsigned len) { struct virtnet_info *vi = netdev_priv(dev); - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); - int err; + struct skb_vnet_hdr *hdr; + struct sk_buff *skb; int i; if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) { @@ -132,39 +173,71 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, goto drop; } - if (vi->mergeable_rx_bufs) { - unsigned int copy; - char *p = page_address(skb_shinfo(skb)->frags[0].page); + if (!vi->mergeable_rx_bufs && !vi->big_packets) { + skb = (struct sk_buff *)buf; + + __skb_unlink(skb, &vi->recv); + + hdr = skb_vnet_hdr(skb); + len -= sizeof(hdr->hdr); + skb_trim(skb, len); + } else { + struct page *page = (struct page *)buf; + int copy, hdr_len, num_buf, offset; + char *p; + + p = page_address(page); - if (len > PAGE_SIZE) - len = PAGE_SIZE; - len -= sizeof(struct virtio_net_hdr_mrg_rxbuf); + skb = netdev_alloc_skb(vi->dev, GOOD_COPY_LEN + NET_IP_ALIGN); + if (unlikely(!skb)) { + dev->stats.rx_dropped++; + return; + } + skb_reserve(skb, NET_IP_ALIGN); + hdr = skb_vnet_hdr(skb); - memcpy(&hdr->mhdr, p, sizeof(hdr->mhdr)); - p += sizeof(hdr->mhdr); + if (vi->mergeable_rx_bufs) { + hdr_len = sizeof(hdr->mhdr); + memcpy(&hdr->mhdr, p, hdr_len); + num_buf = hdr->mhdr.n
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
On Fri, 2009-11-20 at 07:19 +0100, Eric Dumazet wrote: > > +void virtio_free_pages(void *buf) > > +{ > > + struct page *page = (struct page *)buf; > > + > > + while (page) { > > + __free_pages(page, 0); > > + page = (struct page *)page->private; > > + } > > +} > > + > > Interesting use after free :) Good catch. This code was run when virtio_net removal. I run many times of remove virtio_net, and didn't hit any panic :(. Fixed it as below: void virtio_free_pages(void *buf) { struct page *page = (struct page *)buf; struct page *npage; while (page) { npage = page->private; __free_pages(page, 0); page = npage; } } Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
Shirley Ma a écrit : > This patch is generated against 2.6 git tree. I didn't break up this > patch since it has one functionality. Please review it. > > Thanks > Shirley > > Signed-off-by: Shirley Ma > -- > > +void virtio_free_pages(void *buf) > +{ > + struct page *page = (struct page *)buf; > + > + while (page) { > + __free_pages(page, 0); > + page = (struct page *)page->private; > + } > +} > + Interesting use after free :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
This patch is generated against 2.6 git tree. I didn't break up this patch since it has one functionality. Please review it. Thanks Shirley Signed-off-by: Shirley Ma -- diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index b9e002f..6fb788b 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -80,33 +80,48 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb) return (struct skb_vnet_hdr *)skb->cb; } -static void give_a_page(struct virtnet_info *vi, struct page *page) +static void give_pages(struct virtnet_info *vi, struct page *page) { - page->private = (unsigned long)vi->pages; + struct page *npage = (struct page *)page->private; + + if (!npage) + page->private = (unsigned long)vi->pages; + else { + /* give a page list */ + while (npage) { + if (npage->private == (unsigned long)0) { + npage->private = (unsigned long)vi->pages; + break; + } + npage = (struct page *)npage->private; + } + } vi->pages = page; } -static void trim_pages(struct virtnet_info *vi, struct sk_buff *skb) -{ - unsigned int i; - - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) - give_a_page(vi, skb_shinfo(skb)->frags[i].page); - skb_shinfo(skb)->nr_frags = 0; - skb->data_len = 0; -} - static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) { struct page *p = vi->pages; - if (p) + if (p) { vi->pages = (struct page *)p->private; - else + /* use private to chain big packets */ + p->private = (unsigned long)0; + } else p = alloc_page(gfp_mask); return p; } +void virtio_free_pages(void *buf) +{ + struct page *page = (struct page *)buf; + + while (page) { + __free_pages(page, 0); + page = (struct page *)page->private; + } +} + static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq->vdev->priv; @@ -118,12 +133,36 @@ static void skb_xmit_done(struct virtqueue *svq) netif_wake_queue(vi->dev); } -static void receive_skb(struct net_device *dev, struct sk_buff *skb, +static int set_skb_frags(struct sk_buff *skb, struct page *page, + int offset, int len) +{ + int i = skb_shinfo(skb)->nr_frags; + skb_frag_t *f; + + i = skb_shinfo(skb)->nr_frags; + f = &skb_shinfo(skb)->frags[i]; + f->page = page; + f->page_offset = offset; + + if (len > (PAGE_SIZE - f->page_offset)) + f->size = PAGE_SIZE - f->page_offset; + else + f->size = len; + + skb_shinfo(skb)->nr_frags++; + skb->data_len += f->size; + skb->len += f->size; + + len -= f->size; + return len; +} + +static void receive_skb(struct net_device *dev, void *buf, unsigned len) { struct virtnet_info *vi = netdev_priv(dev); - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); - int err; + struct skb_vnet_hdr *hdr; + struct sk_buff *skb; int i; if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) { @@ -132,39 +171,71 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, goto drop; } - if (vi->mergeable_rx_bufs) { - unsigned int copy; - char *p = page_address(skb_shinfo(skb)->frags[0].page); + if (!vi->mergeable_rx_bufs && !vi->big_packets) { + skb = (struct sk_buff *)buf; + + __skb_unlink(skb, &vi->recv); + + hdr = skb_vnet_hdr(skb); + len -= sizeof(hdr->hdr); + skb_trim(skb, len); + } else { + struct page *page = (struct page *)buf; + int copy, hdr_len, num_buf, offset; + char *p; + + p = page_address(page); - if (len > PAGE_SIZE) - len = PAGE_SIZE; - len -= sizeof(struct virtio_net_hdr_mrg_rxbuf); + skb = netdev_alloc_skb(vi->dev, GOOD_COPY_LEN + NET_IP_ALIGN); + if (unlikely(!skb)) { + dev->stats.rx_dropped++; + return; + } + skb_reserve(skb, NET_IP_ALIGN); + hdr = skb_vnet_hdr(skb); - memcpy(&hdr->mhdr, p, sizeof(hdr->mhdr)); - p += sizeof(hdr->mhdr); + if (vi->mergeable_rx_bufs) { + hdr_len = sizeof(hdr->mhdr); + memcpy(&hdr->mhdr, p, hdr_len); + num_buf = hdr->mhdr.num_buffers; + offset = hdr_len; + if (len > PAGE_SIZE) +