On Monday, May 21, 2018 1:07 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: >On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote: >> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn >> <willemdebruijn.ker...@gmail.com> wrote: >>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn >>> <willemdebruijn.ker...@gmail.com> wrote: >>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote: >>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which >>>>> casues the ring to get corrupted by allowing multiple kernel threads >>>>> to claim ownership of the same ring entry. Track ownership in a shadow >>>>> ring structure to prevent other kernel threads from reusing the same >>>>> entry before it's fully filled in, passed to user space, and then >>>>> eventually passed back to the kernel for use with a new packet. >>>>> >>>>> Signed-off-by: Jon Rosen <jro...@cisco.com> >>>>> --- >>>>> >>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages >>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes >>>>> it possible for multiple kernel threads to claim ownership of the same >>>>> ring entry, corrupting the ring and the corresponding packet(s). >>>>> >>>>> These diffs are the second proposed solution, previous proposal was >>>>> described >>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html >>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock >>>>> to prevent RX ring overrun >>>>> >>>>> Those diffs would have changed the binary interface and have broken >>>>> certain >>>>> applications. Consensus was that such a change would be inappropriate. >>>>> >>>>> These new diffs use a shadow ring in kernel space for tracking >>>>> intermediate >>>>> state of an entry and prevent more than one kernel thread from >>>>> simultaneously >>>>> allocating a ring entry. This avoids any impact to the binary interface >>>>> between kernel and userspace but comes at the additional cost of >>>>> requiring a >>>>> second spin_lock when passing ownership of a ring entry to userspace. >>>>> >>>>> Jon Rosen (1): >>>>> packet: track ring entry use using a shadow ring to prevent RX ring >>>>> overrun >>>>> >>>>> net/packet/af_packet.c | 64 >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> net/packet/internal.h | 14 +++++++++++ >>>>> 2 files changed, 78 insertions(+) >>>>> >>>> >>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct >>>>> net_device *dev, >>>>> #endif >>>>> >>>>> if (po->tp_version <= TPACKET_V2) { >>>>> + spin_lock(&sk->sk_receive_queue.lock); >>>>> __packet_set_status(po, h.raw, status); >>>>> + packet_rx_shadow_release(rx_shadow_ring_entry); >>>>> + spin_unlock(&sk->sk_receive_queue.lock); >>>>> + >>>>> sk->sk_data_ready(sk); >>>> >>>> Thanks for continuing to look at this. I spent some time on it last time >>>> around but got stuck, too. >>>> >>>> This version takes an extra spinlock in the hot path. That will be very >>>> expensive. Once we need to accept that, we could opt for a simpler >>>> implementation akin to the one discussed in the previous thread: >>>> >>>> stash a value in tp_padding or similar while tp_status remains >>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel >>>> threads. The issue previously was that that field could not atomically >>>> be cleared together with __packet_set_status. This is no longer >>>> an issue when holding the queue lock. >>>> >>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to >>>> clear it after flipping status (userspace should treat it as undefined). >>>> >>>> With v1 tpacket_hdr, no explicit padding field is defined but due to >>>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit >>>> platforms. >>>> >>>> The danger with using padding is that a process may write to it >>>> and cause deadlock, of course. There is no logical reason for doing >>>> so. >>> >>> For the ring, there is no requirement to allocate exactly the amount >>> specified by the user request. Safer than relying on shared memory >>> and simpler than the extra allocation in this patch would be to allocate >>> extra shadow memory at the end of the ring (and not mmap that). >>> >>> That still leaves an extra cold cacheline vs using tp_padding. >> >> Given my lack of experience and knowledge in writing kernel code >> it was easier for me to allocate the shadow ring as a separate >> structure. Of course it's not about me and my skills so if it's >> more appropriate to allocate at the tail of the existing ring >> then certainly I can look at doing that. >> >> I think the bigger issues as you've pointed out are the cost of >> the additional spin lock and should the additional state be >> stored in-band (fewer cache lines) or out-of band (less risk of >> breaking due to unpredictable application behavior). > > We don't need the spinlock if clearing the shadow byte after > setting the status to user. > > Worst case, user will set it back to kernel while the shadow > byte is not cleared yet and the next producer will drop a packet. > But next producers will make progress, so there is no deadlock > or corruption.
I thought so too for a while but after spending more time than I care to admit I relized the following sequence was occuring: Core A Core B ------ ------ - Enter spin_lock - Get tp_status of head (X) tp_status == 0 - Check inuse inuse == 0 - Allocate entry X advance head (X+1) set inuse=1 - Exit spin_lock <very long delay> <allocate N-1 entries where N = size of ring> - Enter spin_lock - get tp_status of head (X+N) tp_status == 0 (but slot in use for X on core A) - write tp_status of <--- trouble! X = TP_STATUS_USER <--- trouble! - write inuse=0 <--- trouble! - Check inuse inuse == 0 - Allocate entry X+N advance head (X+N+1) set inuse=1 - Exit spin_lock At this point Core A just passed slot X to userspace with a packet and Core B has just been assigned slot X+N (same slot as X) for it's new packet. Both cores A and B end up filling in that slot. Tracking ths donw was one of the reasons it took me a while to produce these updated diffs. > > It probably does require a shadow structure as opposed to a > padding byte to work with the long tail of (possibly broken) > applications, sadly. I agree. > > A setsockopt for userspace to signal a stricter interpretation of > tp_status to elide the shadow hack could then be considered. > It's not pretty. Either way, no full new version is required. > >> As much as I would like to find a solution that doesn't require >> the spin lock I have yet to do so. Maybe the answer is that >> existing applications will need to suffer the performance impact >> but a new version or option for TPACKET_V1/V2 could be added to >> indicate strict adherence of the TP_STATUS_USER bit and then the >> original diffs could be used. >> >> There is another option I was considering but have yet to try >> which would avoid needing a shadow ring by using counter(s) to >> track maximum sequence number queued to userspace vs. the next >> sequence number to be allocated in the ring. If the difference >> is greater than the size of the ring then the ring can be >> considered full and the allocation would fail. Of course this may >> create an additional hotspot between cores, not sure if that >> would be significant or not. > > Please do have a look, but I don't think that this will work in this > case in practice. It requires tracking the producer tail. Updating > the slowest writer requires probing each subsequent slot's status > byte to find the new tail, which is a lot of (by then cold) cacheline > reads. I've thought about it a little more and am not convinced it's workable but I'll spend a little more time on it before giving up.