[dpdk-dev] vhost compliant virtio based networking interface in container

2015-09-14 Thread Xie, Huawei
On 9/8/2015 12:45 PM, Tetsuya Mukawa wrote:
> On 2015/09/07 14:54, Xie, Huawei wrote:
>> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>>> On 2015/08/25 18:56, Xie, Huawei wrote:
 On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
> Hi Xie and Yanping,
>
>
> May I ask you some questions?
> It seems we are also developing an almost same one.
 Good to know that we are tackling the same problem and have the similar
 idea.
 What is your status now? We had the POC running, and compliant with
 dpdkvhost.
 Interrupt like notification isn't supported.
>>> We implemented vhost PMD first, so we just start implementing it.
>>>
> On 2015/08/20 19:14, Xie, Huawei wrote:
>> Added dev at dpdk.org
>>
>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>>> Yanping:
>>> I read your mail, seems what we did are quite similar. Here i wrote a
>>> quick mail to describe our design. Let me know if it is the same thing.
>>>
>>> Problem Statement:
>>> We don't have a high performance networking interface in container for
>>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>>
>>> The key components involved:
>>> 1.DPDK based virtio PMD driver in container.
>>> 2.device simulation framework in container.
>>> 3.dpdk(or kernel) vhost running in host.
>>>
>>> How virtio is created?
>>> A:  There is no "real" virtio-pci device in container environment.
>>> 1). Host maintains pools of memories, and shares memory to container.
>>> This could be accomplished through host share a huge page file to 
>>> container.
>>> 2). Containers creates virtio rings based on the shared memory.
>>> 3). Container creates mbuf memory pools on the shared memory.
>>> 4) Container send the memory and vring information to vhost through
>>> vhost message. This could be done either through ioctl call or vhost
>>> user message.
>>>
>>> How vhost message is sent?
>>> A: There are two alternative ways to do this.
>>> 1) The customized virtio PMD is responsible for all the vring creation,
>>> and vhost message sending.
> Above is our approach so far.
> It seems Yanping also takes this kind of approach.
> We are using vhost-user functionality instead of using the vhost-net
> kernel module.
> Probably this is the difference between Yanping and us.
 In my current implementation, the device simulation layer talks to "user
 space" vhost through cuse interface. It could also be done through vhost
 user socket. This isn't the key point.
 Here vhost-user is kind of confusing, maybe user space vhost is more
 accurate, either cuse or unix domain socket. :).

 As for yanping, they are now connecting to vhost-net kernel module, but
 they are also trying to connect to "user space" vhost.  Correct me if 
 wrong.
 Yes, there is some difference between these two. Vhost-net kernel module
 could directly access other process's memory, while using
 vhost-user(cuse/user), we need do the memory mapping.
> BTW, we are going to submit a vhost PMD for DPDK-2.2.
> This PMD is implemented on librte_vhost.
> It allows DPDK application to handle a vhost-user(cuse) backend as a
> normal NIC port.
> This PMD should work with both Xie and Yanping approach.
> (In the case of Yanping approach, we may need vhost-cuse)
>
>>> 2) We could do this through a lightweight device simulation framework.
>>> The device simulation creates simple PCI bus. On the PCI bus,
>>> virtio-net PCI devices are created. The device simulations provides
>>> IOAPI for MMIO/IO access.
> Does it mean you implemented a kernel module?
> If so, do you still need vhost-cuse functionality to handle vhost
> messages n userspace?
 The device simulation is  a library running in user space in container. 
 It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
 devices.
 The virtio-container-PMD configures the virtio-net pseudo devices
 through IOAPI provided by the device simulation rather than IO
 instructions as in KVM.
 Why we use device simulation?
 We could create other virtio devices in container, and provide an common
 way to talk to vhost-xx module.
>>> Thanks for explanation.
>>> At first reading, I thought the difference between approach1 and
>>> approach2 is whether we need to implement a new kernel module, or not.
>>> But I understand how you implemented.
>>>
>>> Please let me explain our design more.
>>> We might use a kind of similar approach to handle a pseudo virtio-net
>>> device in DPDK.
>>> (Anyway, we haven't finished implementing yet, this overview might have
>>> some technical problems)
>>>
>>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>>> then implement it as a separated program.
>>> The program also has 

[dpdk-dev] vhost compliant virtio based networking interface in container

2015-09-08 Thread Tetsuya Mukawa
On 2015/09/07 14:54, Xie, Huawei wrote:
> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>> On 2015/08/25 18:56, Xie, Huawei wrote:
>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
 Hi Xie and Yanping,


 May I ask you some questions?
 It seems we are also developing an almost same one.
>>> Good to know that we are tackling the same problem and have the similar
>>> idea.
>>> What is your status now? We had the POC running, and compliant with
>>> dpdkvhost.
>>> Interrupt like notification isn't supported.
>> We implemented vhost PMD first, so we just start implementing it.
>>
 On 2015/08/20 19:14, Xie, Huawei wrote:
> Added dev at dpdk.org
>
> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>> Yanping:
>> I read your mail, seems what we did are quite similar. Here i wrote a
>> quick mail to describe our design. Let me know if it is the same thing.
>>
>> Problem Statement:
>> We don't have a high performance networking interface in container for
>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>
>> The key components involved:
>> 1.DPDK based virtio PMD driver in container.
>> 2.device simulation framework in container.
>> 3.dpdk(or kernel) vhost running in host.
>>
>> How virtio is created?
>> A:  There is no "real" virtio-pci device in container environment.
>> 1). Host maintains pools of memories, and shares memory to container.
>> This could be accomplished through host share a huge page file to 
>> container.
>> 2). Containers creates virtio rings based on the shared memory.
>> 3). Container creates mbuf memory pools on the shared memory.
>> 4) Container send the memory and vring information to vhost through
>> vhost message. This could be done either through ioctl call or vhost
>> user message.
>>
>> How vhost message is sent?
>> A: There are two alternative ways to do this.
>> 1) The customized virtio PMD is responsible for all the vring creation,
>> and vhost message sending.
 Above is our approach so far.
 It seems Yanping also takes this kind of approach.
 We are using vhost-user functionality instead of using the vhost-net
 kernel module.
 Probably this is the difference between Yanping and us.
>>> In my current implementation, the device simulation layer talks to "user
>>> space" vhost through cuse interface. It could also be done through vhost
>>> user socket. This isn't the key point.
>>> Here vhost-user is kind of confusing, maybe user space vhost is more
>>> accurate, either cuse or unix domain socket. :).
>>>
>>> As for yanping, they are now connecting to vhost-net kernel module, but
>>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>>> Yes, there is some difference between these two. Vhost-net kernel module
>>> could directly access other process's memory, while using
>>> vhost-user(cuse/user), we need do the memory mapping.
 BTW, we are going to submit a vhost PMD for DPDK-2.2.
 This PMD is implemented on librte_vhost.
 It allows DPDK application to handle a vhost-user(cuse) backend as a
 normal NIC port.
 This PMD should work with both Xie and Yanping approach.
 (In the case of Yanping approach, we may need vhost-cuse)

>> 2) We could do this through a lightweight device simulation framework.
>> The device simulation creates simple PCI bus. On the PCI bus,
>> virtio-net PCI devices are created. The device simulations provides
>> IOAPI for MMIO/IO access.
 Does it mean you implemented a kernel module?
 If so, do you still need vhost-cuse functionality to handle vhost
 messages n userspace?
>>> The device simulation is  a library running in user space in container. 
>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>>> devices.
>>> The virtio-container-PMD configures the virtio-net pseudo devices
>>> through IOAPI provided by the device simulation rather than IO
>>> instructions as in KVM.
>>> Why we use device simulation?
>>> We could create other virtio devices in container, and provide an common
>>> way to talk to vhost-xx module.
>> Thanks for explanation.
>> At first reading, I thought the difference between approach1 and
>> approach2 is whether we need to implement a new kernel module, or not.
>> But I understand how you implemented.
>>
>> Please let me explain our design more.
>> We might use a kind of similar approach to handle a pseudo virtio-net
>> device in DPDK.
>> (Anyway, we haven't finished implementing yet, this overview might have
>> some technical problems)
>>
>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>> then implement it as a separated program.
>> The program also has below features.
>>  - Create a directory that contains almost same files like
>> /sys/bus/pci/device//*
>>(To scan these file located on outside 

[dpdk-dev] vhost compliant virtio based networking interface in container

2015-09-07 Thread Xie, Huawei
On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
> On 2015/08/25 18:56, Xie, Huawei wrote:
>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>>> Hi Xie and Yanping,
>>>
>>>
>>> May I ask you some questions?
>>> It seems we are also developing an almost same one.
>> Good to know that we are tackling the same problem and have the similar
>> idea.
>> What is your status now? We had the POC running, and compliant with
>> dpdkvhost.
>> Interrupt like notification isn't supported.
> We implemented vhost PMD first, so we just start implementing it.
>
>>> On 2015/08/20 19:14, Xie, Huawei wrote:
 Added dev at dpdk.org

 On 8/20/2015 6:04 PM, Xie, Huawei wrote:
> Yanping:
> I read your mail, seems what we did are quite similar. Here i wrote a
> quick mail to describe our design. Let me know if it is the same thing.
>
> Problem Statement:
> We don't have a high performance networking interface in container for
> NFV. Current veth pair based interface couldn't be easily accelerated.
>
> The key components involved:
> 1.DPDK based virtio PMD driver in container.
> 2.device simulation framework in container.
> 3.dpdk(or kernel) vhost running in host.
>
> How virtio is created?
> A:  There is no "real" virtio-pci device in container environment.
> 1). Host maintains pools of memories, and shares memory to container.
> This could be accomplished through host share a huge page file to 
> container.
> 2). Containers creates virtio rings based on the shared memory.
> 3). Container creates mbuf memory pools on the shared memory.
> 4) Container send the memory and vring information to vhost through
> vhost message. This could be done either through ioctl call or vhost
> user message.
>
> How vhost message is sent?
> A: There are two alternative ways to do this.
> 1) The customized virtio PMD is responsible for all the vring creation,
> and vhost message sending.
>>> Above is our approach so far.
>>> It seems Yanping also takes this kind of approach.
>>> We are using vhost-user functionality instead of using the vhost-net
>>> kernel module.
>>> Probably this is the difference between Yanping and us.
>> In my current implementation, the device simulation layer talks to "user
>> space" vhost through cuse interface. It could also be done through vhost
>> user socket. This isn't the key point.
>> Here vhost-user is kind of confusing, maybe user space vhost is more
>> accurate, either cuse or unix domain socket. :).
>>
>> As for yanping, they are now connecting to vhost-net kernel module, but
>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>> Yes, there is some difference between these two. Vhost-net kernel module
>> could directly access other process's memory, while using
>> vhost-user(cuse/user), we need do the memory mapping.
>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>>> This PMD is implemented on librte_vhost.
>>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>>> normal NIC port.
>>> This PMD should work with both Xie and Yanping approach.
>>> (In the case of Yanping approach, we may need vhost-cuse)
>>>
> 2) We could do this through a lightweight device simulation framework.
> The device simulation creates simple PCI bus. On the PCI bus,
> virtio-net PCI devices are created. The device simulations provides
> IOAPI for MMIO/IO access.
>>> Does it mean you implemented a kernel module?
>>> If so, do you still need vhost-cuse functionality to handle vhost
>>> messages n userspace?
>> The device simulation is  a library running in user space in container. 
>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>> devices.
>> The virtio-container-PMD configures the virtio-net pseudo devices
>> through IOAPI provided by the device simulation rather than IO
>> instructions as in KVM.
>> Why we use device simulation?
>> We could create other virtio devices in container, and provide an common
>> way to talk to vhost-xx module.
> Thanks for explanation.
> At first reading, I thought the difference between approach1 and
> approach2 is whether we need to implement a new kernel module, or not.
> But I understand how you implemented.
>
> Please let me explain our design more.
> We might use a kind of similar approach to handle a pseudo virtio-net
> device in DPDK.
> (Anyway, we haven't finished implementing yet, this overview might have
> some technical problems)
>
> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
> then implement it as a separated program.
> The program also has below features.
>  - Create a directory that contains almost same files like
> /sys/bus/pci/device//*
>(To scan these file located on outside sysfs, we need to fix EAL)
>  - This dummy device is driven by dummy-virtio-net-driver. This name is
> specified by '/driver' file.
>  - Create a 

[dpdk-dev] vhost compliant virtio based networking interface in container

2015-08-26 Thread Tetsuya Mukawa
On 2015/08/25 18:56, Xie, Huawei wrote:
> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>> Hi Xie and Yanping,
>>
>>
>> May I ask you some questions?
>> It seems we are also developing an almost same one.
> Good to know that we are tackling the same problem and have the similar
> idea.
> What is your status now? We had the POC running, and compliant with
> dpdkvhost.
> Interrupt like notification isn't supported.

We implemented vhost PMD first, so we just start implementing it.

>
>> On 2015/08/20 19:14, Xie, Huawei wrote:
>>> Added dev at dpdk.org
>>>
>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
 Yanping:
 I read your mail, seems what we did are quite similar. Here i wrote a
 quick mail to describe our design. Let me know if it is the same thing.

 Problem Statement:
 We don't have a high performance networking interface in container for
 NFV. Current veth pair based interface couldn't be easily accelerated.

 The key components involved:
 1.DPDK based virtio PMD driver in container.
 2.device simulation framework in container.
 3.dpdk(or kernel) vhost running in host.

 How virtio is created?
 A:  There is no "real" virtio-pci device in container environment.
 1). Host maintains pools of memories, and shares memory to container.
 This could be accomplished through host share a huge page file to 
 container.
 2). Containers creates virtio rings based on the shared memory.
 3). Container creates mbuf memory pools on the shared memory.
 4) Container send the memory and vring information to vhost through
 vhost message. This could be done either through ioctl call or vhost
 user message.

 How vhost message is sent?
 A: There are two alternative ways to do this.
 1) The customized virtio PMD is responsible for all the vring creation,
 and vhost message sending.
>> Above is our approach so far.
>> It seems Yanping also takes this kind of approach.
>> We are using vhost-user functionality instead of using the vhost-net
>> kernel module.
>> Probably this is the difference between Yanping and us.
> In my current implementation, the device simulation layer talks to "user
> space" vhost through cuse interface. It could also be done through vhost
> user socket. This isn't the key point.
> Here vhost-user is kind of confusing, maybe user space vhost is more
> accurate, either cuse or unix domain socket. :).
>
> As for yanping, they are now connecting to vhost-net kernel module, but
> they are also trying to connect to "user space" vhost.  Correct me if wrong.
> Yes, there is some difference between these two. Vhost-net kernel module
> could directly access other process's memory, while using
> vhost-user(cuse/user), we need do the memory mapping.
>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>> This PMD is implemented on librte_vhost.
>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>> normal NIC port.
>> This PMD should work with both Xie and Yanping approach.
>> (In the case of Yanping approach, we may need vhost-cuse)
>>
 2) We could do this through a lightweight device simulation framework.
 The device simulation creates simple PCI bus. On the PCI bus,
 virtio-net PCI devices are created. The device simulations provides
 IOAPI for MMIO/IO access.
>> Does it mean you implemented a kernel module?
>> If so, do you still need vhost-cuse functionality to handle vhost
>> messages n userspace?
> The device simulation is  a library running in user space in container. 
> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
> devices.
> The virtio-container-PMD configures the virtio-net pseudo devices
> through IOAPI provided by the device simulation rather than IO
> instructions as in KVM.
> Why we use device simulation?
> We could create other virtio devices in container, and provide an common
> way to talk to vhost-xx module.

Thanks for explanation.
At first reading, I thought the difference between approach1 and
approach2 is whether we need to implement a new kernel module, or not.
But I understand how you implemented.

Please let me explain our design more.
We might use a kind of similar approach to handle a pseudo virtio-net
device in DPDK.
(Anyway, we haven't finished implementing yet, this overview might have
some technical problems)

Step1. Separate virtio-net and vhost-user socket related code from QEMU,
then implement it as a separated program.
The program also has below features.
 - Create a directory that contains almost same files like
/sys/bus/pci/device//*
   (To scan these file located on outside sysfs, we need to fix EAL)
 - This dummy device is driven by dummy-virtio-net-driver. This name is
specified by '/driver' file.
 - Create a shared file that represents pci configuration space, then
mmap it, also specify the path in '/resource_path'

The program will be GPL, but it will be like a bridge on 

[dpdk-dev] vhost compliant virtio based networking interface in container

2015-08-25 Thread Tetsuya Mukawa
Hi Xie and Yanping,


May I ask you some questions?
It seems we are also developing an almost same one.

On 2015/08/20 19:14, Xie, Huawei wrote:
> Added dev at dpdk.org
>
> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>> Yanping:
>> I read your mail, seems what we did are quite similar. Here i wrote a
>> quick mail to describe our design. Let me know if it is the same thing.
>>
>> Problem Statement:
>> We don't have a high performance networking interface in container for
>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>
>> The key components involved:
>> 1.DPDK based virtio PMD driver in container.
>> 2.device simulation framework in container.
>> 3.dpdk(or kernel) vhost running in host.
>>
>> How virtio is created?
>> A:  There is no "real" virtio-pci device in container environment.
>> 1). Host maintains pools of memories, and shares memory to container.
>> This could be accomplished through host share a huge page file to container.
>> 2). Containers creates virtio rings based on the shared memory.
>> 3). Container creates mbuf memory pools on the shared memory.
>> 4) Container send the memory and vring information to vhost through
>> vhost message. This could be done either through ioctl call or vhost
>> user message.
>>
>> How vhost message is sent?
>> A: There are two alternative ways to do this.
>> 1) The customized virtio PMD is responsible for all the vring creation,
>> and vhost message sending.

Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.

BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)

>> 2) We could do this through a lightweight device simulation framework.
>> The device simulation creates simple PCI bus. On the PCI bus,
>> virtio-net PCI devices are created. The device simulations provides
>> IOAPI for MMIO/IO access.

Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?

>>2.1  virtio PMD configures the pseudo virtio device as how it does in
>> KVM guest enviroment.
>>2.2  Rather than using io instruction, virtio PMD uses IOAPI for IO
>> operation on the virtio-net PCI device.
>>2.3  The device simulation is responsible for device state machine
>> simulation.
>>2.4   The device simulation is responsbile for talking to vhost.
>>  With this approach, we could minimize the virtio PMD modifications.
>> The virtio PMD is like configuring a real virtio-net PCI device.
>>
>> Memory mapping?
>> A: QEMU could access the whole guest memory in KVM enviroment. We need
>> to fill the gap.
>> container maps the shared memory to container's virtual address space
>> and host maps it to host's virtual address space. There is a fixed
>> offset mapping.
>> Container creates shared vring based on the memory. Container also
>> creates mbuf memory pool based on the shared memroy.
>> In VHOST_SET_MEMORY_TABLE message, we send the memory mapping
>> information for the shared memory. As we require mbuf pool created on
>> the shared memory, and buffers are allcoated from the mbuf pools, dpdk
>> vhost could translate the GPA in vring desc to host virtual.
>>
>>
>> GPA or CVA in vring desc?
>> To ease the memory translation, rather than using GPA, here we use
>> CVA(container virtual address). This the tricky thing here.
>> 1) virtio PMD writes vring's VFN rather than PFN to PFN register through
>> IOAPI.
>> 2) device simulation framework will use VFN as PFN.
>> 3) device simulation sends SET_VRING_ADDR with CVA.
>> 4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather
>> than GPA.
>> So when host sees the CVA, it could translates it to HVA(host virtual
>> address).
>>
>> Worth to note:
>> The virtio interface in container follows the vhost message format, and
>> is compliant with dpdk vhost implmentation, i.e, no dpdk vhost
>> modification is needed.
>> vHost isn't aware whether the incoming virtio comes from KVM guest or
>> container.
>>
>> The pretty much covers the high level design. There are quite some low
>> level issues. For example, 32bit PFN is enough for KVM guest, since we
>> use 64bit VFN(virtual page frame number),  trick is done here through a
>> special IOAPI.

In addition above, we might consider "namespace" kernel functionality.
Technically, it would not be a big problem, but related with security.
So it would be nice to take account.

Regards,
Tetsuya

>> /huawei
>>
>>  
>>
>>
>>
>>
>>
>>