From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Thu, 27 Apr 2006 15:51:26 +0400
> There are some caveats here found while developing zero-copy sniffer > [1]. Project's goal was to remap skbs into userspace in real-time. > While absolute numbers (posted to netdev@) were really high, it is only > applicable to read-only application. As was shown in IOAT thread, > data must be warmed in caches, so reading from mapped area will be as > fast as memcpy() (read+write), and copy_to_user() actually almost equal > to memcpy() (benchmarks were posted to netdev@). And we must add > remapping overhead. Yes, all of these issues are related quite strongly. Thanks for making the connection explicit. But, the mapping overhead is zero for this net channel stuff, at least as it is implemented and designed by Kelly. Ring buffer is setup ahead of time into the user's address space, and a ring of buffers into that area are given to the networking card. We remember the translations here, so no get_user_pages() on each transfer and garbage like that. And yes this all harks back to the issues that are discussed in Chapter 5 of Networking Algorithmics. But the core thing to understand is that by defining a new API and setting up the buffer pool ahead of time, we avoid all of the get_user_pages() overhead while retaining full kernel/user protection. Evgeniy, the difference between this and your work is that you did not have an intelligent piece of hardware that could be told to recognize flows, and only put packets for a specific flow into that's flow's buffer pool. > If we want to dma data from nic into premapped userspace area, this will > strike with message sizes/misalignment/slow read and so on, so > preallocation has even more problems. I do not really think this is an issue, we put the full packet into user space and teach it where the offset is to the actual data. We'll do the same things we do today to try and get the data area aligned. User can do whatever is logical and relevant on his end to deal with strange cases. In fact we can specify that card has to take some care to get data area of packet aligned on say an 8 byte boundary or something like that. When we don't have hardware assist, we are going to be doing copies. > This change also requires significant changes in application, at least > until recv/send are changed, which is not the best thing to do. This is exactly the point, we can only do a good job and receive zero copy if we can change the interfaces, and that's exactly what we're doing here. > I do think that significant win in VJ's tests belongs not to remapping > and cache-oriented changes, but to move all protocol processing into > process' context. I partly disagree. The biggest win is eliminating all of the control overhead (all of "softint RX + protocol demux + IP route lookup + socket lookup" is turned into single flow demux), and the SMP safe data structure which makes it realistic enough to always move the bulk of the packet work to the socket's home cpu. I do not think userspace protocol implementation buys enough to justify it. We have to do the protection switch in and out of kernel space anyways, so why not still do the protected protocol processing work in the kernel? It is still being done on the user's behalf, contributes to his time slice, and avoids all of the terrible issues of userspace protocol implementations. So in my mind, the optimal situation from both a protection preservation and also a performance perspective is net channels to kernel socket protocol processing, buffers DMA'd directly into userspace if hardware assist is present. > I fully agree with Dave that it must be implemented step-by-step, and > the most significant, IMHO, is moving protocol processing into socket's > "place". This will force to netfilter changes, but I do think that for > the proof-of-concept code we can turn it off. And I also want to note that even if the whole idea explodes and cannot be made to work, there are good arguments for transitioning to SKB'less drivers for their own sake. So work will really not be lost. Let's have 100 different implementations of net channels! :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html