On 9/8/2015 12:45 PM, Tetsuya Mukawa wrote: > On 2015/09/07 14:54, Xie, Huawei wrote: >> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote: >>> On 2015/08/25 18:56, Xie, Huawei wrote: >>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote: >>>>> Hi Xie and Yanping, >>>>> >>>>> >>>>> May I ask you some questions? >>>>> It seems we are also developing an almost same one. >>>> Good to know that we are tackling the same problem and have the similar >>>> idea. >>>> What is your status now? We had the POC running, and compliant with >>>> dpdkvhost. >>>> Interrupt like notification isn't supported. >>> We implemented vhost PMD first, so we just start implementing it. >>> >>>>> On 2015/08/20 19:14, Xie, Huawei wrote: >>>>>> Added dev at dpdk.org >>>>>> >>>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote: >>>>>>> Yanping: >>>>>>> I read your mail, seems what we did are quite similar. Here i wrote a >>>>>>> quick mail to describe our design. Let me know if it is the same thing. >>>>>>> >>>>>>> Problem Statement: >>>>>>> We don't have a high performance networking interface in container for >>>>>>> NFV. Current veth pair based interface couldn't be easily accelerated. >>>>>>> >>>>>>> The key components involved: >>>>>>> 1. DPDK based virtio PMD driver in container. >>>>>>> 2. device simulation framework in container. >>>>>>> 3. dpdk(or kernel) vhost running in host. >>>>>>> >>>>>>> How virtio is created? >>>>>>> A: There is no "real" virtio-pci device in container environment. >>>>>>> 1). Host maintains pools of memories, and shares memory to container. >>>>>>> This could be accomplished through host share a huge page file to >>>>>>> container. >>>>>>> 2). Containers creates virtio rings based on the shared memory. >>>>>>> 3). Container creates mbuf memory pools on the shared memory. >>>>>>> 4) Container send the memory and vring information to vhost through >>>>>>> vhost message. This could be done either through ioctl call or vhost >>>>>>> user message. >>>>>>> >>>>>>> How vhost message is sent? >>>>>>> A: There are two alternative ways to do this. >>>>>>> 1) The customized virtio PMD is responsible for all the vring creation, >>>>>>> and vhost message sending. >>>>> Above is our approach so far. >>>>> It seems Yanping also takes this kind of approach. >>>>> We are using vhost-user functionality instead of using the vhost-net >>>>> kernel module. >>>>> Probably this is the difference between Yanping and us. >>>> In my current implementation, the device simulation layer talks to "user >>>> space" vhost through cuse interface. It could also be done through vhost >>>> user socket. This isn't the key point. >>>> Here vhost-user is kind of confusing, maybe user space vhost is more >>>> accurate, either cuse or unix domain socket. :). >>>> >>>> As for yanping, they are now connecting to vhost-net kernel module, but >>>> they are also trying to connect to "user space" vhost. Correct me if >>>> wrong. >>>> Yes, there is some difference between these two. Vhost-net kernel module >>>> could directly access other process's memory, while using >>>> vhost-user(cuse/user), we need do the memory mapping. >>>>> BTW, we are going to submit a vhost PMD for DPDK-2.2. >>>>> This PMD is implemented on librte_vhost. >>>>> It allows DPDK application to handle a vhost-user(cuse) backend as a >>>>> normal NIC port. >>>>> This PMD should work with both Xie and Yanping approach. >>>>> (In the case of Yanping approach, we may need vhost-cuse) >>>>> >>>>>>> 2) We could do this through a lightweight device simulation framework. >>>>>>> The device simulation creates simple PCI bus. On the PCI bus, >>>>>>> virtio-net PCI devices are created. The device simulations provides >>>>>>> IOAPI for MMIO/IO access. >>>>> Does it mean you implemented a kernel module? >>>>> If so, do you still need vhost-cuse functionality to handle vhost >>>>> messages n userspace? >>>> The device simulation is a library running in user space in container. >>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI >>>> devices. >>>> The virtio-container-PMD configures the virtio-net pseudo devices >>>> through IOAPI provided by the device simulation rather than IO >>>> instructions as in KVM. >>>> Why we use device simulation? >>>> We could create other virtio devices in container, and provide an common >>>> way to talk to vhost-xx module. >>> Thanks for explanation. >>> At first reading, I thought the difference between approach1 and >>> approach2 is whether we need to implement a new kernel module, or not. >>> But I understand how you implemented. >>> >>> Please let me explain our design more. >>> We might use a kind of similar approach to handle a pseudo virtio-net >>> device in DPDK. >>> (Anyway, we haven't finished implementing yet, this overview might have >>> some technical problems) >>> >>> Step1. Separate virtio-net and vhost-user socket related code from QEMU, >>> then implement it as a separated program. >>> The program also has below features. >>> - Create a directory that contains almost same files like >>> /sys/bus/pci/device/<pci address>/* >>> (To scan these file located on outside sysfs, we need to fix EAL) >>> - This dummy device is driven by dummy-virtio-net-driver. This name is >>> specified by '<pci addr>/driver' file. >>> - Create a shared file that represents pci configuration space, then >>> mmap it, also specify the path in '<pci addr>/resource_path' >>> >>> The program will be GPL, but it will be like a bridge on the shared >>> memory between virtio-net PMD and DPDK vhost backend. >>> Actually, It will work under virtio-net PMD, but we don't need to link it. >>> So I guess we don't have GPL license issue. >>> >>> Step2. Fix pci scan code of EAL to scan dummy devices. >>> - To scan above files, extend pci_scan() of EAL. >>> >>> Step3. Add a new kdrv type to EAL. >>> - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL. >>> >>> Step4. Implement pci_dummy_virtio_net_map/unmap(). >>> - It will have almost same functionality like pci_uio_map(), but for >>> dummy virtio-net device. >>> - The dummy device will be mmaped using a path specified in '<pci >>> addr>/resource_path'. >>> >>> Step5. Add a new compile option for virtio-net device to replace IO >>> functions. >>> - The IO functions of virtio-net PMD will be replaced by read() and >>> write() to access to the shared memory. >>> - Add notification mechanism to IO functions. This will be used when >>> write() to the shared memory is done. >>> (Not sure exactly, but probably we need it) >>> >>> Does it make sense? >>> I guess Step1&2 is different from your approach, but the rest might be >>> similar. >>> >>> Actually, we just need sysfs entries for a virtio-net dummy device, but >>> so far, I don't have a fine way to register them from user space without >>> loading a kernel module. >> Tetsuya: >> I don't quite get the details. Who will create those sysfs entries? A >> kernel module right? > Hi Xie, > > I don't create sysfs entries. Just create a directory that contains > files looks like sysfs entries. > And initialize EAL with not only sysfs but also the above directory. > > In quoted last sentence, I wanted to say we just needed files looks like > sysfs entries. > But I don't know a good way to create files under sysfs without loading > kernel module. > This is because I try to create the additional directory. > >> The virtio-net is configured through read/write to sharing >> memory(between host and guest), right? > Yes, I agree. > >> Where is shared vring created and shared memory created, on shared huge >> page between host and guest? > The vritqueues(vrings) are on guest hugepage. > > Let me explain. > Guest container should have read/write access to a part of hugepage > directory on host. > (For example, /mnt/huge/conainer1/ is shared between host and guest.) > Also host and guest needs to communicate through a unix domain socket. > (For example, host and guest can communicate with using > "/tmp/container1/sock") > > If we can do like above, a virtio-net PMD on guest can creates > virtqueues(vrings) on it's hugepage, and writes these information to a > pseudo virtio-net device that is a process created in guest container. > Then the pseudo virtio-net device sends it to vhost-user backend(host > DPDK application) through a unix domain socket. > > So with my plan, there are 3 processes. > DPDK applications on host and guest, also a process that works like > virtio-net device. > >> Who will talk to dpdkvhost? > If we need to talk to a cuse device or the vhost-net kernel module, an > above pseudo virtio-net device could talk to. > (But, so far, my target is only vhost-user.) > >>> This is because I need to change pci_scan() also. >>> >>> It seems you have implemented a virtio-net pseudo device as BSD license. >>> If so, this kind of PMD would be nice to use it. >> Currently it is based on native linux kvm tool. > Great, I hadn't noticed this option. > >>> In the case that it takes much time to implement some lost >>> functionalities like interrupt mode, using QEMU code might be an one of >>> options. >> For interrupt mode, i plan to use eventfd for sleep/wake, have not tried >> yet. >>> Anyway, we just need a fine virtual NIC between containers and host. >>> So we don't hold to our approach and implementation. >> Do you have comments to my implementation? >> We could publish the version without the device framework first for >> reference. > No I don't have. Could you please share it? > I am looking forward to seeing it. OK, we are removing the device framework. Hope to publish it in one month's time.
> > Tetsuya >