RE: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-29 Thread Xin, Xiaohui
>-Original Message-
>From: Ben Hutchings [mailto:bhutchi...@solarflare.com]
>Sent: Tuesday, September 28, 2010 5:24 AM
>To: Xin, Xiaohui
>Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
>mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
>jd...@linux.intel.com
>Subject: Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
>
>On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
>> From: Xin Xiaohui 
>>
>> The patch add mp(mediate passthru) device, which now
>> based on vhost-net backend driver and provides proto_ops
>> to send/receive guest buffers data from/to guest vitio-net
>> driver.
>>
>> Signed-off-by: Xin Xiaohui 
>> Signed-off-by: Zhao Yu 
>> Reviewed-by: Jeff Dike 
>> ---
>>  drivers/vhost/mpassthru.c | 1407
>+
>>  1 files changed, 1407 insertions(+), 0 deletions(-)
>>  create mode 100644 drivers/vhost/mpassthru.c
>>
>> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
>> new file mode 100644
>> index 000..d86d94c
>> --- /dev/null
>> +++ b/drivers/vhost/mpassthru.c
>[...]
>> +/* #define MPASSTHRU_DEBUG 1 */
>> +
>> +#ifdef MPASSTHRU_DEBUG
>> +static int debug;
>> +
>> +#define DBG  if (mp->debug) printk
>> +#define DBG1 if (debug == 2) printk
>
>This is unsafe; consider this usage:
>
>   if (foo)
>   DBG("bar\n");
>   else
>   baz();
>
>You should use the standard pr_debug() or dev_dbg() instead.
>
Ok. Move to pr_debug().

>[...]
>> +struct page_ctor {
>> +   struct list_headreadq;
>> +   int wq_len;
>> +   int rq_len;
>> +   spinlock_t  read_lock;
>
>Only one queue?!
>
This readq is for mp device to store the rx guest buffers coming from vhost.
For tx side, we don't need that, since tx buffers are sent out to device 
immediately.

>I would have appreciated some introductory comments on these structures.
>I still don't have any sort of clear picture of how this is all supposed
>to work.
>
Basically, struct page_info{} is used to store each info of the guest buffers
and some of vhost ring descriptor info according to this buffers.
struct page_ctor{} is used to manipulate multiple struct page_info{}.
Each device can do zero-copy is according one struct page_ctor{}.

Anyway, I will comments on fields of these structures.

>[...]
>> +/* The main function to allocate external buffers */
>> +static struct skb_ext_page *page_ctor(struct mpassthru_port *port,
>> +struct sk_buff *skb, int npages)
>> +{
>> +int i;
>> +unsigned long flags;
>> +struct page_ctor *ctor;
>> +struct page_info *info = NULL;
>> +
>> +ctor = container_of(port, struct page_ctor, port);
>> +
>> +spin_lock_irqsave(&ctor->read_lock, flags);
>> +if (!list_empty(&ctor->readq)) {
>> +info = list_first_entry(&ctor->readq, struct page_info, list);
>> +list_del(&info->list);
>> +ctor->rq_len--;
>> +}
>> +spin_unlock_irqrestore(&ctor->read_lock, flags);
>> +if (!info)
>> +return NULL;
>> +
>> +for (i = 0; i < info->pnum; i++)
>> +get_page(info->pages[i]);
>> +info->skb = skb;
>> +return &info->ext_page;
>> +}
>
>Why isn't the npages parameter used?
>

Sorry, somehow forgot that, currently it only allocates one page a time,
we want to use npages to see if we break the limit.

>[...]
>> +static void relinquish_resource(struct page_ctor *ctor)
>> +{
>> +if (!(ctor->dev->flags & IFF_UP) &&
>> +!(ctor->wq_len + ctor->rq_len))
>> +printk(KERN_INFO "relinquish_resource\n");
>> +}
>
>Looks like something's missing here.
>

This function is for debug, and so on the rq_len stuff, I'll remove them later.

>> +static void mp_ki_dtor(struct kiocb *iocb)
>> +{
>> +struct page_info *info = (struct page_info *)(iocb->private);
>> +int i;
>> +
>> +if (info->flags == INFO_READ) {
>> +for (i = 0; i < info->pnum; i++) {
>> +if (info->pages[i]) {
>> +set_page_dirty_lock(info->pages[i]);
>> +put_page(info->pages[i]);
>> +}
>> + 

Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Sridhar Samudrala

 On 9/28/2010 8:18 AM, Arnd Bergmann wrote:

On Tuesday 28 September 2010, Michael S. Tsirkin wrote:

On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote:

Can you be more specific what the problem is? Do you think
it breaks when a guest sends VLAN tagged frames or when macvtap
is connected to a VLAN interface that adds another tag (or
only the combination)?

I expect the protocol value to be wrong when guest sends vlan tagged
frames as 802.1q frames have a different format.

Ok, I see. Would that be fixed by using eth_type_trans()? I don't
see any code in there that tries to deal with the VLAN tag, so
do we have the same problem in the tun/tap driver?
tun_get_user() does call eth_type_trans(). Not sure why i didn't use it 
in macvtap code.

Need to test it with guest VLAN tagging to make sure it works.

Also, I wonder how we handle the case where both the guest and
the host do VLAN tagging. Does the host transparently override
the guest tag, or does it add a nested tag? More importantly,
what should it do?

I would think If both guest and host do VLAN tagging, the tags will be 
nested.


Thanks
Sridhar



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Arnd Bergmann
On Tuesday 28 September 2010, Michael S. Tsirkin wrote:
> On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote:
> > Can you be more specific what the problem is? Do you think
> > it breaks when a guest sends VLAN tagged frames or when macvtap
> > is connected to a VLAN interface that adds another tag (or
> > only the combination)?
>
> I expect the protocol value to be wrong when guest sends vlan tagged
> frames as 802.1q frames have a different format.

Ok, I see. Would that be fixed by using eth_type_trans()? I don't
see any code in there that tries to deal with the VLAN tag, so
do we have the same problem in the tun/tap driver?

Also, I wonder how we handle the case where both the guest and
the host do VLAN tagging. Does the host transparently override
the guest tag, or does it add a nested tag? More importantly,
what should it do?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Michael S. Tsirkin
On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote:
> On Tuesday 28 September 2010, Michael S. Tsirkin wrote:
> > > > +   skb_reserve(skb, NET_IP_ALIGN);
> > > > +   skb_put(skb, len);
> > > > +
> > > > +   if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
> > > > +   kfree_skb(skb);
> > > > +   return -EAGAIN;
> > > > +   }
> > > > +
> > > > +   skb->protocol = eth_type_trans(skb, mp->dev);
> > > 
> > > Why are you calling eth_type_trans() on transmit?
> > 
> > So that GSO can work.  BTW macvtap does:
> > 
> > skb_set_network_header(skb, ETH_HLEN);
> > skb_reset_mac_header(skb);
> > skb->protocol = eth_hdr(skb)->h_proto;
> > 
> > and I think this is broken for vlans. Arnd?
> 
> Hmm, that code (besides set_network_header) was added by Sridhar
> for GSO support. I believe I originally did eth_type_trans but
> had to change it before that time because it broke something.
> Unfortunately, my memory on that is not very good any more.
> 
> Can you be more specific what the problem is? Do you think
> it breaks when a guest sends VLAN tagged frames or when macvtap
> is connected to a VLAN interface that adds another tag (or
> only the combination)?
> 
>   Arnd

I expect the protocol value to be wrong when guest sends vlan tagged
frames as 802.1q frames have a different format.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Arnd Bergmann
On Tuesday 28 September 2010, Michael S. Tsirkin wrote:
> > > +   skb_reserve(skb, NET_IP_ALIGN);
> > > +   skb_put(skb, len);
> > > +
> > > +   if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
> > > +   kfree_skb(skb);
> > > +   return -EAGAIN;
> > > +   }
> > > +
> > > +   skb->protocol = eth_type_trans(skb, mp->dev);
> > 
> > Why are you calling eth_type_trans() on transmit?
> 
> So that GSO can work.  BTW macvtap does:
> 
> skb_set_network_header(skb, ETH_HLEN);
> skb_reset_mac_header(skb);
> skb->protocol = eth_hdr(skb)->h_proto;
> 
> and I think this is broken for vlans. Arnd?

Hmm, that code (besides set_network_header) was added by Sridhar
for GSO support. I believe I originally did eth_type_trans but
had to change it before that time because it broke something.
Unfortunately, my memory on that is not very good any more.

Can you be more specific what the problem is? Do you think
it breaks when a guest sends VLAN tagged frames or when macvtap
is connected to a VLAN interface that adds another tag (or
only the combination)?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Michael S. Tsirkin
On Mon, Sep 27, 2010 at 10:23:33PM +0100, Ben Hutchings wrote:
> > +/* The main function to transform the guest user space address
> > + * to host kernel address via get_user_pages(). Thus the hardware
> > + * can do DMA directly to the external buffer address.
> > + */
> > +static struct page_info *alloc_page_info(struct page_ctor *ctor,
> > +   struct kiocb *iocb, struct iovec *iov,
> > +   int count, struct frag *frags,
> > +   int npages, int total)
> > +{
> > +   int rc;
> > +   int i, j, n = 0;
> > +   int len;
> > +   unsigned long base, lock_limit;
> > +   struct page_info *info = NULL;
> > +
> > +   lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> > +   lock_limit >>= PAGE_SHIFT;
> > +
> > +   if (ctor->lock_pages + count > lock_limit && npages) {
> > +   printk(KERN_INFO "exceed the locked memory rlimit.");
> > +   return NULL;
> > +   }
> 
> What if the process is locking pages with mlock() as well?  Doesn't this
> allow it to lock twice as many pages as it should be able to?

No, since locked_vm is incremented by both mp and mlock.
Or at least, that's the idea :)
In any case, twice as many would not be a big deal: admin can control
which processes can do this by controlling access to the device.

> > +   info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
> > +   
> > +   if (!info)
> > +   return NULL;
> > +   info->skb = NULL;
> > +   info->next = info->prev = NULL;
> > +
> > +   for (i = j = 0; i < count; i++) {
> > +   base = (unsigned long)iov[i].iov_base;
> > +   len = iov[i].iov_len;
> > +
> > +   if (!len)
> > +   continue;
> > +   n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> > +
> > +   rc = get_user_pages_fast(base, n, npages ? 1 : 0,
> > +   &info->pages[j]);
> > +   if (rc != n)
> > +   goto failed;
> > +
> > +   while (n--) {
> > +   frags[j].offset = base & ~PAGE_MASK;
> > +   frags[j].size = min_t(int, len,
> > +   PAGE_SIZE - frags[j].offset);
> > +   len -= frags[j].size;
> > +   base += frags[j].size;
> > +   j++;
> > +   }
> > +   }
> > +
> > +#ifdef CONFIG_HIGHMEM
> > +   if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
> > +   for (i = 0; i < j; i++) {
> > +   if (PageHighMem(info->pages[i]))
> > +   goto failed;
> > +   }
> > +   }
> > +#endif
> 
> Shouldn't you try to allocate lowmem pages explicitly, rather than
> failing at this point?

We don't allocate pages, we lock given pages. Once this is
integrated in macvtap presumably we'll fall back on data copy
for such devices.

...

> > +   skb_reserve(skb, NET_IP_ALIGN);
> > +   skb_put(skb, len);
> > +
> > +   if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
> > +   kfree_skb(skb);
> > +   return -EAGAIN;
> > +   }
> > +
> > +   skb->protocol = eth_type_trans(skb, mp->dev);
> 
> Why are you calling eth_type_trans() on transmit?

So that GSO can work.  BTW macvtap does:

skb_set_network_header(skb, ETH_HLEN);
skb_reset_mac_header(skb);
skb->protocol = eth_hdr(skb)->h_proto;

and I think this is broken for vlans. Arnd?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-27 Thread Ben Hutchings
On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
> From: Xin Xiaohui 
> 
> The patch add mp(mediate passthru) device, which now
> based on vhost-net backend driver and provides proto_ops
> to send/receive guest buffers data from/to guest vitio-net
> driver.
> 
> Signed-off-by: Xin Xiaohui 
> Signed-off-by: Zhao Yu 
> Reviewed-by: Jeff Dike 
> ---
>  drivers/vhost/mpassthru.c | 1407 
> +
>  1 files changed, 1407 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vhost/mpassthru.c
> 
> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> new file mode 100644
> index 000..d86d94c
> --- /dev/null
> +++ b/drivers/vhost/mpassthru.c
[...]
> +/* #define MPASSTHRU_DEBUG 1 */
> +
> +#ifdef MPASSTHRU_DEBUG
> +static int debug;
> +
> +#define DBG  if (mp->debug) printk
> +#define DBG1 if (debug == 2) printk

This is unsafe; consider this usage:

if (foo)
DBG("bar\n");
else
baz();

You should use the standard pr_debug() or dev_dbg() instead.

[...]
> +struct page_ctor {
> +   struct list_headreadq;
> +   int wq_len;
> +   int rq_len;
> +   spinlock_t  read_lock;

Only one queue?!

I would have appreciated some introductory comments on these structures.
I still don't have any sort of clear picture of how this is all supposed
to work.

[...]
> +/* The main function to allocate external buffers */
> +static struct skb_ext_page *page_ctor(struct mpassthru_port *port,
> + struct sk_buff *skb, int npages)
> +{
> + int i;
> + unsigned long flags;
> + struct page_ctor *ctor;
> + struct page_info *info = NULL;
> +
> + ctor = container_of(port, struct page_ctor, port);
> +
> + spin_lock_irqsave(&ctor->read_lock, flags);
> + if (!list_empty(&ctor->readq)) {
> + info = list_first_entry(&ctor->readq, struct page_info, list);
> + list_del(&info->list);
> + ctor->rq_len--;
> + }
> + spin_unlock_irqrestore(&ctor->read_lock, flags);
> + if (!info)
> + return NULL;
> +
> + for (i = 0; i < info->pnum; i++)
> + get_page(info->pages[i]);
> + info->skb = skb;
> + return &info->ext_page;
> +}

Why isn't the npages parameter used?

[...]
> +static void relinquish_resource(struct page_ctor *ctor)
> +{
> + if (!(ctor->dev->flags & IFF_UP) &&
> + !(ctor->wq_len + ctor->rq_len))
> + printk(KERN_INFO "relinquish_resource\n");
> +}

Looks like something's missing here.

> +static void mp_ki_dtor(struct kiocb *iocb)
> +{
> + struct page_info *info = (struct page_info *)(iocb->private);
> + int i;
> +
> + if (info->flags == INFO_READ) {
> + for (i = 0; i < info->pnum; i++) {
> + if (info->pages[i]) {
> + set_page_dirty_lock(info->pages[i]);
> + put_page(info->pages[i]);
> + }
> + }
> + mp_hash_delete(info->ctor, info);
> + if (info->skb) {
> + info->skb->destructor = NULL;
> + kfree_skb(info->skb);
> + }
> + info->ctor->rq_len--;

Doesn't rq_len represent the number of buffers queued between the guest
and the driver?  It is already decremented in page_ctor() so it seems
like it gets decremented twice for each buffer.  Also don't you need to
hold the read_lock when updating rq_len?

> + } else
> + info->ctor->wq_len--;

Maybe you should define rq_len and wq_len both as atomic_t.

[...]
> +static void __mp_detach(struct mp_struct *mp)
> +{
> + mp->mfile = NULL;
> +
> + mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> + page_ctor_detach(mp);
> + mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);

This is racy; you should hold rtnl_lock over all these changes.

[...]
> +typedef u32 key_mp_t;
> +static inline key_mp_t mp_hash(struct page *page, int buckets)
> +{
> + key_mp_t k;
> +
> + k = unsigned long)page << 32UL) >> 32UL) / 0x38) % buckets ;
> + return k;
> +}

This is never going to work on a 32-bit machine, and what is the purpose
of the magic number 0x38?

Try using hash_ptr() from .

> +static struct page_info *mp_hash_delete(struct page_ctor *ctor,
> + struct page_info *info)
> +{
> + key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
> + struct page_info *tmp = NULL;
> + int i;
> +
> + tmp = ctor->hash_table[key];
> + while (tmp) {
> + if (tmp == info) {
> + if (!tmp->prev) {
> + ctor->hash_table[key] = tmp->next;
> + if (tmp->next)
> + tmp->next->prev = NULL;
> + } else {
> + t