RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
> -Original Message- > From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com] > Sent: Wednesday, May 23, 2018 9:37 AM > To: Jon Rosen (jrosen) <jro...@cisco.com> > Cc: David S. Miller <da...@davemloft.net>; Willem de Bruijn > <will...@google.com>; Eric Dumazet > <eduma...@google.com>; Kees Cook <keesc...@chromium.org>; David Windsor > <dwind...@gmail.com>; Rosen, > Rami <rami.ro...@intel.com>; Reshetova, Elena <elena.reshet...@intel.com>; > Mike Maloney > <malo...@google.com>; Benjamin Poirier <bpoir...@suse.com>; Thomas Gleixner > <t...@linutronix.de>; Greg > Kroah-Hartman <gre...@linuxfoundation.org>; open list:NETWORKING [GENERAL] > <net...@vger.kernel.org>; > open list <linux-kernel@vger.kernel.org> > Subject: Re: [PATCH v2] packet: track ring entry use using a shadow ring to > prevent RX ring overrun > > On Wed, May 23, 2018 at 7:54 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote: > >> > For the ring, there is no requirement to allocate exactly the amount > >> > specified by the user request. Safer than relying on shared memory > >> > and simpler than the extra allocation in this patch would be to allocate > >> > extra shadow memory at the end of the ring (and not mmap that). > >> > > >> > That still leaves an extra cold cacheline vs using tp_padding. > >> > >> Given my lack of experience and knowledge in writing kernel code > >> it was easier for me to allocate the shadow ring as a separate > >> structure. Of course it's not about me and my skills so if it's > >> more appropriate to allocate at the tail of the existing ring > >> then certainly I can look at doing that. > > > > The memory for the ring is not one contiguous block, it's an array of > > blocks of pages (or 'order' sized blocks of pages). I don't think > > increasing the size of each of the blocks to provided storage would be > > such a good idea as it will risk spilling over into the next order and > > wasting lots of memory. I suspect it's also more complex than a single > > shadow ring to do both the allocation and the access. > > > > It could be tacked onto the end of the pg_vec[] used to store the > > pointers to the blocks. The challenge with that is that a pg_vec[] is > > created for each of RX and TX rings so either it would have to > > allocate unnecessary storage for TX or the caller will have to say if > > extra space should be allocated or not. E.g.: > > > > static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int > > scratch, void **scratch_p) > > > > I'm not sure avoiding the extra allocation and moving it to the > > pg_vec[] for the RX ring is going to get the simplification you were > > hoping for. Is there another way of storing the shadow ring which > > I should consider? > > I did indeed mean attaching extra pages to pg_vec[]. It should be > simpler than a separate structure, but I may be wrong. I don't think it would be too bad, it may actually turn out to be convenient to implement. > > Either way, I still would prefer to avoid the shadow buffer completely. > It incurs complexity and cycle cost on all users because of only the > rare (non-existent?) consumer that overwrites the padding bytes. I prefer that as well. I'm just not sure there is a bulletproof solution without the shadow state. I also wish it were only a theoretical issue but unfortunately it is actually something our customers have seen. > > Perhaps we can use padding yet avoid deadlock by writing a > timed value. The simplest would be jiffies >> N. Then only a > process that writes this exact value would be subject to drops and > then still only for a limited period. > > Instead of depending on wall clock time, like jiffies, another option > would be to keep a percpu array of values. Each cpu has a zero > entry if it is not writing, nonzero if it is. If a writer encounters a > number in padding that is > num_cpus, then the state is garbage > from userspace. If <= num_cpus, it is adhered to only until that cpu > clears its entry, which is guaranteed to happen eventually. > > Just a quick thought. This might not fly at all upon closer scrutiny. I'm not sure I understand the suggestion, but I'll think on it some more. Some other options maybe worth considering (in no specific order): - test the application to see if it will consume entries if tp_status is set to anything other than TP_STATUS_USER, only use shadow if it doesn't strictly honor the TP_STATUS_USER bit. - skip shadow if we see new TP_STATUS_USER_TO_KERNEL is used - use tp_len == -1 to indicate inuse
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
> -Original Message- > From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com] > Sent: Wednesday, May 23, 2018 9:37 AM > To: Jon Rosen (jrosen) > Cc: David S. Miller ; Willem de Bruijn > ; Eric Dumazet > ; Kees Cook ; David Windsor > ; Rosen, > Rami ; Reshetova, Elena ; > Mike Maloney > ; Benjamin Poirier ; Thomas Gleixner > ; Greg > Kroah-Hartman ; open list:NETWORKING [GENERAL] > ; > open list > Subject: Re: [PATCH v2] packet: track ring entry use using a shadow ring to > prevent RX ring overrun > > On Wed, May 23, 2018 at 7:54 AM, Jon Rosen (jrosen) wrote: > >> > For the ring, there is no requirement to allocate exactly the amount > >> > specified by the user request. Safer than relying on shared memory > >> > and simpler than the extra allocation in this patch would be to allocate > >> > extra shadow memory at the end of the ring (and not mmap that). > >> > > >> > That still leaves an extra cold cacheline vs using tp_padding. > >> > >> Given my lack of experience and knowledge in writing kernel code > >> it was easier for me to allocate the shadow ring as a separate > >> structure. Of course it's not about me and my skills so if it's > >> more appropriate to allocate at the tail of the existing ring > >> then certainly I can look at doing that. > > > > The memory for the ring is not one contiguous block, it's an array of > > blocks of pages (or 'order' sized blocks of pages). I don't think > > increasing the size of each of the blocks to provided storage would be > > such a good idea as it will risk spilling over into the next order and > > wasting lots of memory. I suspect it's also more complex than a single > > shadow ring to do both the allocation and the access. > > > > It could be tacked onto the end of the pg_vec[] used to store the > > pointers to the blocks. The challenge with that is that a pg_vec[] is > > created for each of RX and TX rings so either it would have to > > allocate unnecessary storage for TX or the caller will have to say if > > extra space should be allocated or not. E.g.: > > > > static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int > > scratch, void **scratch_p) > > > > I'm not sure avoiding the extra allocation and moving it to the > > pg_vec[] for the RX ring is going to get the simplification you were > > hoping for. Is there another way of storing the shadow ring which > > I should consider? > > I did indeed mean attaching extra pages to pg_vec[]. It should be > simpler than a separate structure, but I may be wrong. I don't think it would be too bad, it may actually turn out to be convenient to implement. > > Either way, I still would prefer to avoid the shadow buffer completely. > It incurs complexity and cycle cost on all users because of only the > rare (non-existent?) consumer that overwrites the padding bytes. I prefer that as well. I'm just not sure there is a bulletproof solution without the shadow state. I also wish it were only a theoretical issue but unfortunately it is actually something our customers have seen. > > Perhaps we can use padding yet avoid deadlock by writing a > timed value. The simplest would be jiffies >> N. Then only a > process that writes this exact value would be subject to drops and > then still only for a limited period. > > Instead of depending on wall clock time, like jiffies, another option > would be to keep a percpu array of values. Each cpu has a zero > entry if it is not writing, nonzero if it is. If a writer encounters a > number in padding that is > num_cpus, then the state is garbage > from userspace. If <= num_cpus, it is adhered to only until that cpu > clears its entry, which is guaranteed to happen eventually. > > Just a quick thought. This might not fly at all upon closer scrutiny. I'm not sure I understand the suggestion, but I'll think on it some more. Some other options maybe worth considering (in no specific order): - test the application to see if it will consume entries if tp_status is set to anything other than TP_STATUS_USER, only use shadow if it doesn't strictly honor the TP_STATUS_USER bit. - skip shadow if we see new TP_STATUS_USER_TO_KERNEL is used - use tp_len == -1 to indicate inuse
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
> > For the ring, there is no requirement to allocate exactly the amount > > specified by the user request. Safer than relying on shared memory > > and simpler than the extra allocation in this patch would be to allocate > > extra shadow memory at the end of the ring (and not mmap that). > > > > That still leaves an extra cold cacheline vs using tp_padding. > > Given my lack of experience and knowledge in writing kernel code > it was easier for me to allocate the shadow ring as a separate > structure. Of course it's not about me and my skills so if it's > more appropriate to allocate at the tail of the existing ring > then certainly I can look at doing that. The memory for the ring is not one contiguous block, it's an array of blocks of pages (or 'order' sized blocks of pages). I don't think increasing the size of each of the blocks to provided storage would be such a good idea as it will risk spilling over into the next order and wasting lots of memory. I suspect it's also more complex than a single shadow ring to do both the allocation and the access. It could be tacked onto the end of the pg_vec[] used to store the pointers to the blocks. The challenge with that is that a pg_vec[] is created for each of RX and TX rings so either it would have to allocate unnecessary storage for TX or the caller will have to say if extra space should be allocated or not. E.g.: static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int scratch, void **scratch_p) I'm not sure avoiding the extra allocation and moving it to the pg_vec[] for the RX ring is going to get the simplification you were hoping for. Is there another way of storing the shadow ring which I should consider?
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
> > For the ring, there is no requirement to allocate exactly the amount > > specified by the user request. Safer than relying on shared memory > > and simpler than the extra allocation in this patch would be to allocate > > extra shadow memory at the end of the ring (and not mmap that). > > > > That still leaves an extra cold cacheline vs using tp_padding. > > Given my lack of experience and knowledge in writing kernel code > it was easier for me to allocate the shadow ring as a separate > structure. Of course it's not about me and my skills so if it's > more appropriate to allocate at the tail of the existing ring > then certainly I can look at doing that. The memory for the ring is not one contiguous block, it's an array of blocks of pages (or 'order' sized blocks of pages). I don't think increasing the size of each of the blocks to provided storage would be such a good idea as it will risk spilling over into the next order and wasting lots of memory. I suspect it's also more complex than a single shadow ring to do both the allocation and the access. It could be tacked onto the end of the pg_vec[] used to store the pointers to the blocks. The challenge with that is that a pg_vec[] is created for each of RX and TX rings so either it would have to allocate unnecessary storage for TX or the caller will have to say if extra space should be allocated or not. E.g.: static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int scratch, void **scratch_p) I'm not sure avoiding the extra allocation and moving it to the pg_vec[] for the RX ring is going to get the simplification you were hoping for. Is there another way of storing the shadow ring which I should consider?
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
> >>> I think the bigger issues as you've pointed out are the cost of > >>> the additional spin lock and should the additional state be > >>> stored in-band (fewer cache lines) or out-of band (less risk of > >>> breaking due to unpredictable application behavior). > >> > >> We don't need the spinlock if clearing the shadow byte after > >> setting the status to user. > >> > >> Worst case, user will set it back to kernel while the shadow > >> byte is not cleared yet and the next producer will drop a packet. > >> But next producers will make progress, so there is no deadlock > >> or corruption. > > > > I thought so too for a while but after spending more time than I > > care to admit I relized the following sequence was occuring: > > > >Core A Core B > >-- -- > >- Enter spin_lock > >- Get tp_status of head (X) > >tp_status == 0 > >- Check inuse > >inuse == 0 > >- Allocate entry X > >advance head (X+1) > >set inuse=1 > >- Exit spin_lock > > > > > > > > > where N = size of ring> > > > > - Enter spin_lock > > - get tp_status of head (X+N) > > tp_status == 0 (but slot > > in use for X on core A) > > > >- write tp_status of <--- trouble! > > X = TP_STATUS_USER <--- trouble! > >- write inuse=0 <--- trouble! > > > > - Check inuse > > inuse == 0 > > - Allocate entry X+N > > advance head (X+N+1) > > set inuse=1 > > - Exit spin_lock > > > > > > At this point Core A just passed slot X to userspace with a > > packet and Core B has just been assigned slot X+N (same slot as > > X) for it's new packet. Both cores A and B end up filling in that > > slot. Tracking ths donw was one of the reasons it took me a > > while to produce these updated diffs. > > Is this not just an ordering issue? Since inuse is set after tp_status, > it has to be tested first (and barriers are needed to avoid reordering). I changed the code as you suggest to do the inuse check first and removed the extra added spin_lock/unlock and it seems to be working. I was able to run through the night without an issue (normally I would hit the ring corruption in 1 to 2 hours). Thanks for pointing that out, I should have caught that myself. Next I'll look at your suggestion for where to put the shadow ring.
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
> >>> I think the bigger issues as you've pointed out are the cost of > >>> the additional spin lock and should the additional state be > >>> stored in-band (fewer cache lines) or out-of band (less risk of > >>> breaking due to unpredictable application behavior). > >> > >> We don't need the spinlock if clearing the shadow byte after > >> setting the status to user. > >> > >> Worst case, user will set it back to kernel while the shadow > >> byte is not cleared yet and the next producer will drop a packet. > >> But next producers will make progress, so there is no deadlock > >> or corruption. > > > > I thought so too for a while but after spending more time than I > > care to admit I relized the following sequence was occuring: > > > >Core A Core B > >-- -- > >- Enter spin_lock > >- Get tp_status of head (X) > >tp_status == 0 > >- Check inuse > >inuse == 0 > >- Allocate entry X > >advance head (X+1) > >set inuse=1 > >- Exit spin_lock > > > > > > > > > where N = size of ring> > > > > - Enter spin_lock > > - get tp_status of head (X+N) > > tp_status == 0 (but slot > > in use for X on core A) > > > >- write tp_status of <--- trouble! > > X = TP_STATUS_USER <--- trouble! > >- write inuse=0 <--- trouble! > > > > - Check inuse > > inuse == 0 > > - Allocate entry X+N > > advance head (X+N+1) > > set inuse=1 > > - Exit spin_lock > > > > > > At this point Core A just passed slot X to userspace with a > > packet and Core B has just been assigned slot X+N (same slot as > > X) for it's new packet. Both cores A and B end up filling in that > > slot. Tracking ths donw was one of the reasons it took me a > > while to produce these updated diffs. > > Is this not just an ordering issue? Since inuse is set after tp_status, > it has to be tested first (and barriers are needed to avoid reordering). I changed the code as you suggest to do the inuse check first and removed the extra added spin_lock/unlock and it seems to be working. I was able to run through the night without an issue (normally I would hit the ring corruption in 1 to 2 hours). Thanks for pointing that out, I should have caught that myself. Next I'll look at your suggestion for where to put the shadow ring.
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
On Monday, May 21, 2018 2:17 PM, Jon Rosen (jrosen) <jro...@cisco.com> wrote: > On Monday, May 21, 2018 1:07 PM, Willem de Bruijn > <willemdebruijn.ker...@gmail.com> wrote: >> On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote: ...snip... >> >> A setsockopt for userspace to signal a stricter interpretation of >> tp_status to elide the shadow hack could then be considered. >> It's not pretty. Either way, no full new version is required. >> >>> As much as I would like to find a solution that doesn't require >>> the spin lock I have yet to do so. Maybe the answer is that >>> existing applications will need to suffer the performance impact >>> but a new version or option for TPACKET_V1/V2 could be added to >>> indicate strict adherence of the TP_STATUS_USER bit and then the >>> original diffs could be used. It looks like adding new socket options is pretty rare so I wonder if a better option might be to define a new TP_STATUS_XXX bit which would signal from a userspace application to the kernel that it strictly interprets the TP_STATUS_USER bit to determine ownership. Todays applications set tp_status = TP_STATUS_KERNEL(0) for the kernel to pick up the entry. We could define a new value to pass ownership as well as one to indicate to other kernel threads that an entry is inuse: #define TP_STATUS_USER_TO_KERNEL(1 << 8) #define TP_STATUS_INUSE (1 << 9) If the kernel sees tp_status == TP_STATUS_KERNEL then it should use the shadow method for tacking ownership. If it sees tp_status == TP_STATUS_USER_TO_KERNEL then it can use the TP_STATUS_INUSE method. >>> >>> There is another option I was considering but have yet to try >>> which would avoid needing a shadow ring by using counter(s) to >>> track maximum sequence number queued to userspace vs. the next >>> sequence number to be allocated in the ring. If the difference >>> is greater than the size of the ring then the ring can be >>> considered full and the allocation would fail. Of course this may >>> create an additional hotspot between cores, not sure if that >>> would be significant or not. >> >> Please do have a look, but I don't think that this will work in this >> case in practice. It requires tracking the producer tail. Updating >> the slowest writer requires probing each subsequent slot's status >> byte to find the new tail, which is a lot of (by then cold) cacheline >> reads. > > I've thought about it a little more and am not convinced it's > workable but I'll spend a little more time on it before giving > up. I've given up on this method. Just don't see how to make it work.
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
On Monday, May 21, 2018 2:17 PM, Jon Rosen (jrosen) wrote: > On Monday, May 21, 2018 1:07 PM, Willem de Bruijn > wrote: >> On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) wrote: ...snip... >> >> A setsockopt for userspace to signal a stricter interpretation of >> tp_status to elide the shadow hack could then be considered. >> It's not pretty. Either way, no full new version is required. >> >>> As much as I would like to find a solution that doesn't require >>> the spin lock I have yet to do so. Maybe the answer is that >>> existing applications will need to suffer the performance impact >>> but a new version or option for TPACKET_V1/V2 could be added to >>> indicate strict adherence of the TP_STATUS_USER bit and then the >>> original diffs could be used. It looks like adding new socket options is pretty rare so I wonder if a better option might be to define a new TP_STATUS_XXX bit which would signal from a userspace application to the kernel that it strictly interprets the TP_STATUS_USER bit to determine ownership. Todays applications set tp_status = TP_STATUS_KERNEL(0) for the kernel to pick up the entry. We could define a new value to pass ownership as well as one to indicate to other kernel threads that an entry is inuse: #define TP_STATUS_USER_TO_KERNEL(1 << 8) #define TP_STATUS_INUSE (1 << 9) If the kernel sees tp_status == TP_STATUS_KERNEL then it should use the shadow method for tacking ownership. If it sees tp_status == TP_STATUS_USER_TO_KERNEL then it can use the TP_STATUS_INUSE method. >>> >>> There is another option I was considering but have yet to try >>> which would avoid needing a shadow ring by using counter(s) to >>> track maximum sequence number queued to userspace vs. the next >>> sequence number to be allocated in the ring. If the difference >>> is greater than the size of the ring then the ring can be >>> considered full and the allocation would fail. Of course this may >>> create an additional hotspot between cores, not sure if that >>> would be significant or not. >> >> Please do have a look, but I don't think that this will work in this >> case in practice. It requires tracking the producer tail. Updating >> the slowest writer requires probing each subsequent slot's status >> byte to find the new tail, which is a lot of (by then cold) cacheline >> reads. > > I've thought about it a little more and am not convinced it's > workable but I'll spend a little more time on it before giving > up. I've given up on this method. Just don't see how to make it work.
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
On Monday, May 21, 2018 1:07 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: >On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote: >> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn >> <willemdebruijn.ker...@gmail.com> wrote: >>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn >>> <willemdebruijn.ker...@gmail.com> wrote: >>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote: >>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which >>>>> casues the ring to get corrupted by allowing multiple kernel threads >>>>> to claim ownership of the same ring entry. Track ownership in a shadow >>>>> ring structure to prevent other kernel threads from reusing the same >>>>> entry before it's fully filled in, passed to user space, and then >>>>> eventually passed back to the kernel for use with a new packet. >>>>> >>>>> Signed-off-by: Jon Rosen <jro...@cisco.com> >>>>> --- >>>>> >>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages >>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes >>>>> it possible for multiple kernel threads to claim ownership of the same >>>>> ring entry, corrupting the ring and the corresponding packet(s). >>>>> >>>>> These diffs are the second proposed solution, previous proposal was >>>>> described >>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html >>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock >>>>> to prevent RX ring overrun >>>>> >>>>> Those diffs would have changed the binary interface and have broken >>>>> certain >>>>> applications. Consensus was that such a change would be inappropriate. >>>>> >>>>> These new diffs use a shadow ring in kernel space for tracking >>>>> intermediate >>>>> state of an entry and prevent more than one kernel thread from >>>>> simultaneously >>>>> allocating a ring entry. This avoids any impact to the binary interface >>>>> between kernel and userspace but comes at the additional cost of >>>>> requiring a >>>>> second spin_lock when passing ownership of a ring entry to userspace. >>>>> >>>>> Jon Rosen (1): >>>>> packet: track ring entry use using a shadow ring to prevent RX ring >>>>> overrun >>>>> >>>>> net/packet/af_packet.c | 64 >>>>> ++ >>>>> net/packet/internal.h | 14 +++ >>>>> 2 files changed, 78 insertions(+) >>>>> >>>> >>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct >>>>> net_device *dev, >>>>> #endif >>>>> >>>>> if (po->tp_version <= TPACKET_V2) { >>>>> + spin_lock(>sk_receive_queue.lock); >>>>> __packet_set_status(po, h.raw, status); >>>>> + packet_rx_shadow_release(rx_shadow_ring_entry); >>>>> + spin_unlock(>sk_receive_queue.lock); >>>>> + >>>>> sk->sk_data_ready(sk); >>>> >>>> Thanks for continuing to look at this. I spent some time on it last time >>>> around but got stuck, too. >>>> >>>> This version takes an extra spinlock in the hot path. That will be very >>>> expensive. Once we need to accept that, we could opt for a simpler >>>> implementation akin to the one discussed in the previous thread: >>>> >>>> stash a value in tp_padding or similar while tp_status remains >>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel >>>> threads. The issue previously was that that field could not atomically >>>> be cleared together with __packet_set_status. This is no longer >>>> an issue when holding the queue lock. >>>> >>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to >>>> clear it after flipping status (userspace should treat it as undefined). >>>> >>>> With v1 tpacket_hdr, no explicit padding field is defined but due to >>&g
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
On Monday, May 21, 2018 1:07 PM, Willem de Bruijn wrote: >On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) wrote: >> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn >> wrote: >>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn >>> wrote: >>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen wrote: >>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which >>>>> casues the ring to get corrupted by allowing multiple kernel threads >>>>> to claim ownership of the same ring entry. Track ownership in a shadow >>>>> ring structure to prevent other kernel threads from reusing the same >>>>> entry before it's fully filled in, passed to user space, and then >>>>> eventually passed back to the kernel for use with a new packet. >>>>> >>>>> Signed-off-by: Jon Rosen >>>>> --- >>>>> >>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages >>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes >>>>> it possible for multiple kernel threads to claim ownership of the same >>>>> ring entry, corrupting the ring and the corresponding packet(s). >>>>> >>>>> These diffs are the second proposed solution, previous proposal was >>>>> described >>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html >>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock >>>>> to prevent RX ring overrun >>>>> >>>>> Those diffs would have changed the binary interface and have broken >>>>> certain >>>>> applications. Consensus was that such a change would be inappropriate. >>>>> >>>>> These new diffs use a shadow ring in kernel space for tracking >>>>> intermediate >>>>> state of an entry and prevent more than one kernel thread from >>>>> simultaneously >>>>> allocating a ring entry. This avoids any impact to the binary interface >>>>> between kernel and userspace but comes at the additional cost of >>>>> requiring a >>>>> second spin_lock when passing ownership of a ring entry to userspace. >>>>> >>>>> Jon Rosen (1): >>>>> packet: track ring entry use using a shadow ring to prevent RX ring >>>>> overrun >>>>> >>>>> net/packet/af_packet.c | 64 >>>>> ++ >>>>> net/packet/internal.h | 14 +++ >>>>> 2 files changed, 78 insertions(+) >>>>> >>>> >>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct >>>>> net_device *dev, >>>>> #endif >>>>> >>>>> if (po->tp_version <= TPACKET_V2) { >>>>> + spin_lock(>sk_receive_queue.lock); >>>>> __packet_set_status(po, h.raw, status); >>>>> + packet_rx_shadow_release(rx_shadow_ring_entry); >>>>> + spin_unlock(>sk_receive_queue.lock); >>>>> + >>>>> sk->sk_data_ready(sk); >>>> >>>> Thanks for continuing to look at this. I spent some time on it last time >>>> around but got stuck, too. >>>> >>>> This version takes an extra spinlock in the hot path. That will be very >>>> expensive. Once we need to accept that, we could opt for a simpler >>>> implementation akin to the one discussed in the previous thread: >>>> >>>> stash a value in tp_padding or similar while tp_status remains >>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel >>>> threads. The issue previously was that that field could not atomically >>>> be cleared together with __packet_set_status. This is no longer >>>> an issue when holding the queue lock. >>>> >>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to >>>> clear it after flipping status (userspace should treat it as undefined). >>>> >>>> With v1 tpacket_hdr, no explicit padding field is defined but due to >>>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit >>>> platforms. >>>> >>>> The danger with using padding is that a process may write to it &g
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn > <willemdebruijn.ker...@gmail.com> wrote: >> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote: >>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which >>> casues the ring to get corrupted by allowing multiple kernel threads >>> to claim ownership of the same ring entry. Track ownership in a shadow >>> ring structure to prevent other kernel threads from reusing the same >>> entry before it's fully filled in, passed to user space, and then >>> eventually passed back to the kernel for use with a new packet. >>> >>> Signed-off-by: Jon Rosen <jro...@cisco.com> >>> --- >>> >>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages >>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes >>> it possible for multiple kernel threads to claim ownership of the same >>> ring entry, corrupting the ring and the corresponding packet(s). >>> >>> These diffs are the second proposed solution, previous proposal was >>> described >>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html >>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock >>> to prevent RX ring overrun >>> >>> Those diffs would have changed the binary interface and have broken certain >>> applications. Consensus was that such a change would be inappropriate. >>> >>> These new diffs use a shadow ring in kernel space for tracking intermediate >>> state of an entry and prevent more than one kernel thread from >>> simultaneously >>> allocating a ring entry. This avoids any impact to the binary interface >>> between kernel and userspace but comes at the additional cost of requiring a >>> second spin_lock when passing ownership of a ring entry to userspace. >>> >>> Jon Rosen (1): >>> packet: track ring entry use using a shadow ring to prevent RX ring >>> overrun >>> >>> net/packet/af_packet.c | 64 >>> ++ >>> net/packet/internal.h | 14 +++ >>> 2 files changed, 78 insertions(+) >>> >> >>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct >>> net_device *dev, >>> #endif >>> >>> if (po->tp_version <= TPACKET_V2) { >>> + spin_lock(>sk_receive_queue.lock); >>> __packet_set_status(po, h.raw, status); >>> + packet_rx_shadow_release(rx_shadow_ring_entry); >>> + spin_unlock(>sk_receive_queue.lock); >>> + >>> sk->sk_data_ready(sk); >> >> Thanks for continuing to look at this. I spent some time on it last time >> around but got stuck, too. >> >> This version takes an extra spinlock in the hot path. That will be very >> expensive. Once we need to accept that, we could opt for a simpler >> implementation akin to the one discussed in the previous thread: >> >> stash a value in tp_padding or similar while tp_status remains >> TP_STATUS_KERNEL to signal ownership to concurrent kernel >> threads. The issue previously was that that field could not atomically >> be cleared together with __packet_set_status. This is no longer >> an issue when holding the queue lock. >> >> With a field like tp_padding, unlike tp_len, it is arguably also safe to >> clear it after flipping status (userspace should treat it as undefined). >> >> With v1 tpacket_hdr, no explicit padding field is defined but due to >> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit >> platforms. >> >> The danger with using padding is that a process may write to it >> and cause deadlock, of course. There is no logical reason for doing >> so. > > For the ring, there is no requirement to allocate exactly the amount > specified by the user request. Safer than relying on shared memory > and simpler than the extra allocation in this patch would be to allocate > extra shadow memory at the end of the ring (and not mmap that). > > That still leaves an extra cold cacheline vs using tp_padding. Given my lack of experience and knowledge in writing kernel code it was easier for me to allocate the shadow ring as a separate structure. Of course it's not about me and my skills so if it's more appropriate to allocate at the tai
RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn wrote: > On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn > wrote: >> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen wrote: >>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which >>> casues the ring to get corrupted by allowing multiple kernel threads >>> to claim ownership of the same ring entry. Track ownership in a shadow >>> ring structure to prevent other kernel threads from reusing the same >>> entry before it's fully filled in, passed to user space, and then >>> eventually passed back to the kernel for use with a new packet. >>> >>> Signed-off-by: Jon Rosen >>> --- >>> >>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages >>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes >>> it possible for multiple kernel threads to claim ownership of the same >>> ring entry, corrupting the ring and the corresponding packet(s). >>> >>> These diffs are the second proposed solution, previous proposal was >>> described >>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html >>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock >>> to prevent RX ring overrun >>> >>> Those diffs would have changed the binary interface and have broken certain >>> applications. Consensus was that such a change would be inappropriate. >>> >>> These new diffs use a shadow ring in kernel space for tracking intermediate >>> state of an entry and prevent more than one kernel thread from >>> simultaneously >>> allocating a ring entry. This avoids any impact to the binary interface >>> between kernel and userspace but comes at the additional cost of requiring a >>> second spin_lock when passing ownership of a ring entry to userspace. >>> >>> Jon Rosen (1): >>> packet: track ring entry use using a shadow ring to prevent RX ring >>> overrun >>> >>> net/packet/af_packet.c | 64 >>> ++ >>> net/packet/internal.h | 14 +++ >>> 2 files changed, 78 insertions(+) >>> >> >>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct >>> net_device *dev, >>> #endif >>> >>> if (po->tp_version <= TPACKET_V2) { >>> + spin_lock(>sk_receive_queue.lock); >>> __packet_set_status(po, h.raw, status); >>> + packet_rx_shadow_release(rx_shadow_ring_entry); >>> + spin_unlock(>sk_receive_queue.lock); >>> + >>> sk->sk_data_ready(sk); >> >> Thanks for continuing to look at this. I spent some time on it last time >> around but got stuck, too. >> >> This version takes an extra spinlock in the hot path. That will be very >> expensive. Once we need to accept that, we could opt for a simpler >> implementation akin to the one discussed in the previous thread: >> >> stash a value in tp_padding or similar while tp_status remains >> TP_STATUS_KERNEL to signal ownership to concurrent kernel >> threads. The issue previously was that that field could not atomically >> be cleared together with __packet_set_status. This is no longer >> an issue when holding the queue lock. >> >> With a field like tp_padding, unlike tp_len, it is arguably also safe to >> clear it after flipping status (userspace should treat it as undefined). >> >> With v1 tpacket_hdr, no explicit padding field is defined but due to >> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit >> platforms. >> >> The danger with using padding is that a process may write to it >> and cause deadlock, of course. There is no logical reason for doing >> so. > > For the ring, there is no requirement to allocate exactly the amount > specified by the user request. Safer than relying on shared memory > and simpler than the extra allocation in this patch would be to allocate > extra shadow memory at the end of the ring (and not mmap that). > > That still leaves an extra cold cacheline vs using tp_padding. Given my lack of experience and knowledge in writing kernel code it was easier for me to allocate the shadow ring as a separate structure. Of course it's not about me and my skills so if it's more appropriate to allocate at the tail of the existing ring then certainly I can look at doing that. I think the bigger issues as you've pointed out are the cost of the addi
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
Forward link to a new proposed patch at: https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
Forward link to a new proposed patch at: https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html
[PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which casues the ring to get corrupted by allowing multiple kernel threads to claim ownership of the same ring entry. Track ownership in a shadow ring structure to prevent other kernel threads from reusing the same entry before it's fully filled in, passed to user space, and then eventually passed back to the kernel for use with a new packet. Signed-off-by: Jon Rosen <jro...@cisco.com> --- There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes it possible for multiple kernel threads to claim ownership of the same ring entry, corrupting the ring and the corresponding packet(s). These diffs are the second proposed solution, previous proposal was described in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun Those diffs would have changed the binary interface and have broken certain applications. Consensus was that such a change would be inappropriate. These new diffs use a shadow ring in kernel space for tracking intermediate state of an entry and prevent more than one kernel thread from simultaneously allocating a ring entry. This avoids any impact to the binary interface between kernel and userspace but comes at the additional cost of requiring a second spin_lock when passing ownership of a ring entry to userspace. Jon Rosen (1): packet: track ring entry use using a shadow ring to prevent RX ring overrun net/packet/af_packet.c | 64 ++ net/packet/internal.h | 14 +++ 2 files changed, 78 insertions(+) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index e0f3f4a..4d08c8e 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2165,6 +2165,26 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, return 0; } +static inline void *packet_rx_shadow_aquire_head(struct packet_sock *po) +{ + struct packet_ring_shadow_entry *entry; + + entry = >rx_shadow.ring[po->rx_ring.head]; + if (unlikely(entry->inuse)) + return NULL; + + entry->inuse = 1; + return (void *)entry; +} + +static inline void packet_rx_shadow_release(void *_entry) +{ + struct packet_ring_shadow_entry *entry; + + entry = (struct packet_ring_shadow_entry *)_entry; + entry->inuse = 0; +} + static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { @@ -2182,6 +2202,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, __u32 ts_status; bool is_drop_n_account = false; bool do_vnet = false; + void *rx_shadow_ring_entry = NULL; /* struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT. * We may add members to them until current aligned size without forcing @@ -2277,7 +2298,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (!h.raw) goto drop_n_account; if (po->tp_version <= TPACKET_V2) { + /* Attempt to allocate shadow ring entry. +* If already inuse then the ring is full. +*/ + rx_shadow_ring_entry = packet_rx_shadow_aquire_head(po); + if (unlikely(!rx_shadow_ring_entry)) + goto ring_is_full; + packet_increment_rx_head(po, >rx_ring); + /* * LOSING will be reported till you read the stats, * because it's COR - Clear On Read. @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, #endif if (po->tp_version <= TPACKET_V2) { + spin_lock(>sk_receive_queue.lock); __packet_set_status(po, h.raw, status); + packet_rx_shadow_release(rx_shadow_ring_entry); + spin_unlock(>sk_receive_queue.lock); + sk->sk_data_ready(sk); } else { prb_clear_blk_fill_status(>rx_ring); @@ -4197,6 +4230,25 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) goto out; } +static struct packet_ring_shadow_entry * + packet_rx_shadow_alloc(unsigned int tp_frame_nr) +{ + struct packet_ring_shadow_entry *rx_shadow_ring; + int ring_size; + int i; + + ring_size = tp_frame_nr * sizeof(*rx_shadow_ring); + rx_shadow_ring = kmalloc(ring_size, GFP_KERNEL); + + if (!rx_shadow_ring) + return NULL; + + for (i = 0; i < tp_frame_nr; i++) + rx_shadow_ring[i].inuse = 0; + + return rx_shadow_ring; +} + static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
[PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which casues the ring to get corrupted by allowing multiple kernel threads to claim ownership of the same ring entry. Track ownership in a shadow ring structure to prevent other kernel threads from reusing the same entry before it's fully filled in, passed to user space, and then eventually passed back to the kernel for use with a new packet. Signed-off-by: Jon Rosen --- There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes it possible for multiple kernel threads to claim ownership of the same ring entry, corrupting the ring and the corresponding packet(s). These diffs are the second proposed solution, previous proposal was described in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun Those diffs would have changed the binary interface and have broken certain applications. Consensus was that such a change would be inappropriate. These new diffs use a shadow ring in kernel space for tracking intermediate state of an entry and prevent more than one kernel thread from simultaneously allocating a ring entry. This avoids any impact to the binary interface between kernel and userspace but comes at the additional cost of requiring a second spin_lock when passing ownership of a ring entry to userspace. Jon Rosen (1): packet: track ring entry use using a shadow ring to prevent RX ring overrun net/packet/af_packet.c | 64 ++ net/packet/internal.h | 14 +++ 2 files changed, 78 insertions(+) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index e0f3f4a..4d08c8e 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2165,6 +2165,26 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, return 0; } +static inline void *packet_rx_shadow_aquire_head(struct packet_sock *po) +{ + struct packet_ring_shadow_entry *entry; + + entry = >rx_shadow.ring[po->rx_ring.head]; + if (unlikely(entry->inuse)) + return NULL; + + entry->inuse = 1; + return (void *)entry; +} + +static inline void packet_rx_shadow_release(void *_entry) +{ + struct packet_ring_shadow_entry *entry; + + entry = (struct packet_ring_shadow_entry *)_entry; + entry->inuse = 0; +} + static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { @@ -2182,6 +2202,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, __u32 ts_status; bool is_drop_n_account = false; bool do_vnet = false; + void *rx_shadow_ring_entry = NULL; /* struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT. * We may add members to them until current aligned size without forcing @@ -2277,7 +2298,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (!h.raw) goto drop_n_account; if (po->tp_version <= TPACKET_V2) { + /* Attempt to allocate shadow ring entry. +* If already inuse then the ring is full. +*/ + rx_shadow_ring_entry = packet_rx_shadow_aquire_head(po); + if (unlikely(!rx_shadow_ring_entry)) + goto ring_is_full; + packet_increment_rx_head(po, >rx_ring); + /* * LOSING will be reported till you read the stats, * because it's COR - Clear On Read. @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, #endif if (po->tp_version <= TPACKET_V2) { + spin_lock(>sk_receive_queue.lock); __packet_set_status(po, h.raw, status); + packet_rx_shadow_release(rx_shadow_ring_entry); + spin_unlock(>sk_receive_queue.lock); + sk->sk_data_ready(sk); } else { prb_clear_blk_fill_status(>rx_ring); @@ -4197,6 +4230,25 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) goto out; } +static struct packet_ring_shadow_entry * + packet_rx_shadow_alloc(unsigned int tp_frame_nr) +{ + struct packet_ring_shadow_entry *rx_shadow_ring; + int ring_size; + int i; + + ring_size = tp_frame_nr * sizeof(*rx_shadow_ring); + rx_shadow_ring = kmalloc(ring_size, GFP_KERNEL); + + if (!rx_shadow_ring) + return NULL; + + for (i = 0; i < tp_frame_nr; i++) + rx_shadow_ring[i].inuse = 0; + + return rx_shadow_ring; +} + static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, int closing, int
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > >> >is that the documentation of the tp_status field is somewhat > >> >inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > >> >meaning the entry is owned by user space. In other places ownership > >> >by user space is defined by the TP_STATUS_USER(1) bit being set. > >> > >> But indeed this example in packet_mmap.txt is problematic > >> > >> if (status == TP_STATUS_KERNEL) > >> retval = poll(, 1, timeout); > >> > >> It does not really matter whether the docs are possibly inconsistent and > >> which one is authoritative. Examples like the above make it likely that > >> some user code expects such code to work. > > > > Yes, that's exactly my concern. Yet another troubling example seems to be > > lipbcap which also is looking specifically for status to be anything other > > than > > TP_STATUS_KERNEL(0) to indicate a frame is available in user space. > > Good catch. If pcap-linux.c relies on this then the status field > cannot be changed. Other fields can be modified freely while tp_status > remains 0, perhaps that's an option. Possibly. Someone else suggested something similar but in at least the one example we thought through it still seemed like it didn't address the problem. For example, let's say we used tp_len == -1 to indicate to other kernel threads that the entry was already in progress. This would require that user space never set tp_len = -1 before returning the entry back to the kernel. If it did then no kernel thread would ever claim ownership and the ring would hang. Now, it seems pretty unlikely that user space would do such a thing so maybe we could look past that, but then we run into the issue that there is still a window of opportunity for other kernel threads to come in and wrap the ring. The reason is we can't set tp_len to the correct length after setting tp_status because user space could grab the entry and see tp_len == -1 so we have to set tp_len before we set tp_status. This means that there is still a window where other kernel threads could come in and see tp_len as something other than -1 and a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry. This puts us back to where we are today (arguably with a smaller window, but a window none the less). Alternatively we could reacquire the spin_lock to then set tp_len followed by tp_status. This would give the necessary indivisibility in the kernel while preserving proper order as made visible to user space, but it comes at the cost of another spin_lock. Thanks for the suggestion. If you can think of a way around this I'm all ears. I'll think on this some more but so far I'm stuck on how to get past having to broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort of atomic construct along with a parallel shadow ring structure (still thinking through that one as well).
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > >> >is that the documentation of the tp_status field is somewhat > >> >inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > >> >meaning the entry is owned by user space. In other places ownership > >> >by user space is defined by the TP_STATUS_USER(1) bit being set. > >> > >> But indeed this example in packet_mmap.txt is problematic > >> > >> if (status == TP_STATUS_KERNEL) > >> retval = poll(, 1, timeout); > >> > >> It does not really matter whether the docs are possibly inconsistent and > >> which one is authoritative. Examples like the above make it likely that > >> some user code expects such code to work. > > > > Yes, that's exactly my concern. Yet another troubling example seems to be > > lipbcap which also is looking specifically for status to be anything other > > than > > TP_STATUS_KERNEL(0) to indicate a frame is available in user space. > > Good catch. If pcap-linux.c relies on this then the status field > cannot be changed. Other fields can be modified freely while tp_status > remains 0, perhaps that's an option. Possibly. Someone else suggested something similar but in at least the one example we thought through it still seemed like it didn't address the problem. For example, let's say we used tp_len == -1 to indicate to other kernel threads that the entry was already in progress. This would require that user space never set tp_len = -1 before returning the entry back to the kernel. If it did then no kernel thread would ever claim ownership and the ring would hang. Now, it seems pretty unlikely that user space would do such a thing so maybe we could look past that, but then we run into the issue that there is still a window of opportunity for other kernel threads to come in and wrap the ring. The reason is we can't set tp_len to the correct length after setting tp_status because user space could grab the entry and see tp_len == -1 so we have to set tp_len before we set tp_status. This means that there is still a window where other kernel threads could come in and see tp_len as something other than -1 and a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry. This puts us back to where we are today (arguably with a smaller window, but a window none the less). Alternatively we could reacquire the spin_lock to then set tp_len followed by tp_status. This would give the necessary indivisibility in the kernel while preserving proper order as made visible to user space, but it comes at the cost of another spin_lock. Thanks for the suggestion. If you can think of a way around this I'm all ears. I'll think on this some more but so far I'm stuck on how to get past having to broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort of atomic construct along with a parallel shadow ring structure (still thinking through that one as well).
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn <will...@google.com> wrote: > > On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen <jro...@cisco.com> wrote: > > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which > > casues the ring to get corrupted by allowing multiple kernel threads > > to claim ownership of the same ring entry, Mark the ring entry as > > already being used within the spin_lock to prevent other kernel > > threads from reusing the same entry before it's fully filled in, > > passed to user space, and then eventually passed back to the kernel > > for use with a new packet. > > > > Note that the proposed change may modify the semantics of the > > interface between kernel space and user space in a way which may cause > > some applications to no longer work properly. > > As long as TP_STATUS_USER (1) is not set, userspace should ignore > the slot.. > > >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > >is that the documentation of the tp_status field is somewhat > >inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > >meaning the entry is owned by user space. In other places ownership > >by user space is defined by the TP_STATUS_USER(1) bit being set. > > But indeed this example in packet_mmap.txt is problematic > > if (status == TP_STATUS_KERNEL) > retval = poll(, 1, timeout); > > It does not really matter whether the docs are possibly inconsistent and > which one is authoritative. Examples like the above make it likely that > some user code expects such code to work. Yes, that's exactly my concern. Yet another troubling example seems to be lipbcap which also is looking specifically for status to be anything other than TP_STATUS_KERNEL(0) to indicate a frame is available in user space. Either way things are broken. They are broken as they stand now because the ring can get overrun and the kernel and user space tracking of the ring can get out of sync. And they are broken with the below change because some user space applications will be looking for anything other than TP_STATUS_KERNEL, so again the ring will get out of sync. The difference here being that the way it is today is on average (across all environments and across all user space apps) less likely to occur while with the change below it is much more likely to occur. Maybe the right answer here is to implement a fix that is compatible for existing applications and accept any potential performance impacts and then add yet another version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for passing ownership. > > > +++ b/net/packet/af_packet.c > > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct > > net_device *dev, > > if (po->stats.stats1.tp_drops) > > status |= TP_STATUS_LOSING; > > } > > + > > +/* > > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > > + * kernel threads from re-using this same entry. > > + */ > > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > > No need to reinterpret existing flags. tp_status is a u32 with > sufficient undefined bits. Agreed. > > > + if (po->tp_version <= TPACKET_V2) > > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > + > > po->stats.stats1.tp_packets++; > > if (copy_skb) { > > status |= TP_STATUS_COPY; > > -- > > 2.10.3.dirty > > Thanks for the feedback! Jon.
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn wrote: > > On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen wrote: > > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which > > casues the ring to get corrupted by allowing multiple kernel threads > > to claim ownership of the same ring entry, Mark the ring entry as > > already being used within the spin_lock to prevent other kernel > > threads from reusing the same entry before it's fully filled in, > > passed to user space, and then eventually passed back to the kernel > > for use with a new packet. > > > > Note that the proposed change may modify the semantics of the > > interface between kernel space and user space in a way which may cause > > some applications to no longer work properly. > > As long as TP_STATUS_USER (1) is not set, userspace should ignore > the slot.. > > >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > >is that the documentation of the tp_status field is somewhat > >inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > >meaning the entry is owned by user space. In other places ownership > >by user space is defined by the TP_STATUS_USER(1) bit being set. > > But indeed this example in packet_mmap.txt is problematic > > if (status == TP_STATUS_KERNEL) > retval = poll(, 1, timeout); > > It does not really matter whether the docs are possibly inconsistent and > which one is authoritative. Examples like the above make it likely that > some user code expects such code to work. Yes, that's exactly my concern. Yet another troubling example seems to be lipbcap which also is looking specifically for status to be anything other than TP_STATUS_KERNEL(0) to indicate a frame is available in user space. Either way things are broken. They are broken as they stand now because the ring can get overrun and the kernel and user space tracking of the ring can get out of sync. And they are broken with the below change because some user space applications will be looking for anything other than TP_STATUS_KERNEL, so again the ring will get out of sync. The difference here being that the way it is today is on average (across all environments and across all user space apps) less likely to occur while with the change below it is much more likely to occur. Maybe the right answer here is to implement a fix that is compatible for existing applications and accept any potential performance impacts and then add yet another version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for passing ownership. > > > +++ b/net/packet/af_packet.c > > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct > > net_device *dev, > > if (po->stats.stats1.tp_drops) > > status |= TP_STATUS_LOSING; > > } > > + > > +/* > > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > > + * kernel threads from re-using this same entry. > > + */ > > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > > No need to reinterpret existing flags. tp_status is a u32 with > sufficient undefined bits. Agreed. > > > + if (po->tp_version <= TPACKET_V2) > > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > + > > po->stats.stats1.tp_packets++; > > if (copy_skb) { > > status |= TP_STATUS_COPY; > > -- > > 2.10.3.dirty > > Thanks for the feedback! Jon.
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > > index e0f3f4a..264d7b2 100644 > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct > > net_device *dev, > > if (po->stats.stats1.tp_drops) > > status |= TP_STATUS_LOSING; > > } > > + > > +/* > > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > > + * kernel threads from re-using this same entry. > > + */ > > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > > + if (po->tp_version <= TPACKET_V2) > > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > + > > po->stats.stats1.tp_packets++; > > if (copy_skb) { > > status |= TP_STATUS_COPY; > > This patch looks correct. Please resend it with proper signed-off-by > and with a kernel code indenting style (tabs). Is this bug present > since the beginning of af_packet and multiqueue devices or did it get > introduced in some previous kernel? Sorry about the tabs, I'll fix that and try to figure out what I did wrong with the signed-off-by. I've looked back as far as I could find online (2.6.11) and it would appear that this bug has always been there. Thanks, jon.
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > > index e0f3f4a..264d7b2 100644 > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct > > net_device *dev, > > if (po->stats.stats1.tp_drops) > > status |= TP_STATUS_LOSING; > > } > > + > > +/* > > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > > + * kernel threads from re-using this same entry. > > + */ > > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > > + if (po->tp_version <= TPACKET_V2) > > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > + > > po->stats.stats1.tp_packets++; > > if (copy_skb) { > > status |= TP_STATUS_COPY; > > This patch looks correct. Please resend it with proper signed-off-by > and with a kernel code indenting style (tabs). Is this bug present > since the beginning of af_packet and multiqueue devices or did it get > introduced in some previous kernel? Sorry about the tabs, I'll fix that and try to figure out what I did wrong with the signed-off-by. I've looked back as far as I could find online (2.6.11) and it would appear that this bug has always been there. Thanks, jon.
[RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which casues the ring to get corrupted by allowing multiple kernel threads to claim ownership of the same ring entry, Mark the ring entry as already being used within the spin_lock to prevent other kernel threads from reusing the same entry before it's fully filled in, passed to user space, and then eventually passed back to the kernel for use with a new packet. Note that the proposed change may modify the semantics of the interface between kernel space and user space in a way which may cause some applications to no longer work properly. More discussion on this change can be found in the additional comments section titled "3. Discussion on packet_mmap ownership semantics:". Signed-off-by: Jon Rosen <jro...@cisco.com> --- Additional Comments Section --- 1. Description of the diffs: TPACKET_V1 and TPACKET_V2 format rings: --- Mark each entry as TP_STATUS_IN_PROGRESS after allocating to prevent other kernel threads from re-using the same entry. This is necessary because there may be a delay from the time the spin_lock is released to the time that the packet is completed and the corresponding ring entry is marked as owned by user space. If during this time other kernel threads enqueue more packets to the ring than the size of the ring then it will cause mutliple kernel threads to operate on the same entry at the same time, corrupting packets and the ring state. By marking the entry as allocated (IN_PROGRESS) we prevent other kernel threads from incorrectly re-using an entry that is still in the progress of being filled in before it is passed to user space. This forces each entry through the following states: +-> 1. (tp_status == TP_STATUS_KERNEL) | Free: For use by any kernel thread to store a new packet | | 2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER) | Allocated: In use by a *specific* kernel thread | | 3. (tp_status & TP_STATUS_USER) | Available: Packet available for user space to process | +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL No impact on TPACKET_V3 format rings: - Packet entry ownership is already protected from other kernel threads potentially re-using the same entry. This is done inside packet_current_rx_frame() where storage is allocated for the current packet. Since this is done within the spin_lock no additional status updates for this entry are required. Defining TP_STATUS_IN_PROGRESS: --- Rather than defining a new-bit we re-use an existing bit for this intermediate state. Status will eventually be overwritten with the actual true status when passed to user space. Any bit used to pass information to user space other than the one that passes ownership is suitable (can't use TP_STATUS_USER). Alternatively a new bit could be defined. 2. More detailed discussion: Ring entries basically have 2 states, owned by the kernel or owned by user space. For single producer/single consumer this works fine. For multiple producers there is a window between the call to spin_unlock [F] and the call to __packet_set_status [J] where if there are enough packets added to the ring by other kernel threads then the ring can wrap and multiple threads will end up using the same ring entries. This occurs because the ring entry alocated at [C] did not modify the state of the entry so it continues to appear as owned by the kernel and available for use for new packets even though it has already been allocated. A simple fix is to temporarily mark the ring entries within the spin lock such that user space will still think it?s owned by the kernel and other kernel threads will not see it as available to be used for new packets. If a kernel thread gets delayed between [F] and [J] for an extended period of time and the ring wraps back to the same point then subsiquent kernel threads attempts to allocate will fail and be treated as the ring being full. The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit to prevent other kernel threads from re-using the same entry. Note that any existing bit other than TP_STATUS_USER could have been used. af_packet.c:tpacket_rcv() ... code removed for brevity ... // Acquire spin lock A:spin_lock(>sk_receive_queue.lock); // Preemption is disabled // Get current ring entry B: h.raw = packet_current_rx_frame( po, skb, TP_STATUS_KERNEL, (macoff+snaplen)); // Get out if ring is full // Code not show but it will also release lock
[RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which casues the ring to get corrupted by allowing multiple kernel threads to claim ownership of the same ring entry, Mark the ring entry as already being used within the spin_lock to prevent other kernel threads from reusing the same entry before it's fully filled in, passed to user space, and then eventually passed back to the kernel for use with a new packet. Note that the proposed change may modify the semantics of the interface between kernel space and user space in a way which may cause some applications to no longer work properly. More discussion on this change can be found in the additional comments section titled "3. Discussion on packet_mmap ownership semantics:". Signed-off-by: Jon Rosen --- Additional Comments Section --- 1. Description of the diffs: TPACKET_V1 and TPACKET_V2 format rings: --- Mark each entry as TP_STATUS_IN_PROGRESS after allocating to prevent other kernel threads from re-using the same entry. This is necessary because there may be a delay from the time the spin_lock is released to the time that the packet is completed and the corresponding ring entry is marked as owned by user space. If during this time other kernel threads enqueue more packets to the ring than the size of the ring then it will cause mutliple kernel threads to operate on the same entry at the same time, corrupting packets and the ring state. By marking the entry as allocated (IN_PROGRESS) we prevent other kernel threads from incorrectly re-using an entry that is still in the progress of being filled in before it is passed to user space. This forces each entry through the following states: +-> 1. (tp_status == TP_STATUS_KERNEL) | Free: For use by any kernel thread to store a new packet | | 2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER) | Allocated: In use by a *specific* kernel thread | | 3. (tp_status & TP_STATUS_USER) | Available: Packet available for user space to process | +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL No impact on TPACKET_V3 format rings: - Packet entry ownership is already protected from other kernel threads potentially re-using the same entry. This is done inside packet_current_rx_frame() where storage is allocated for the current packet. Since this is done within the spin_lock no additional status updates for this entry are required. Defining TP_STATUS_IN_PROGRESS: --- Rather than defining a new-bit we re-use an existing bit for this intermediate state. Status will eventually be overwritten with the actual true status when passed to user space. Any bit used to pass information to user space other than the one that passes ownership is suitable (can't use TP_STATUS_USER). Alternatively a new bit could be defined. 2. More detailed discussion: Ring entries basically have 2 states, owned by the kernel or owned by user space. For single producer/single consumer this works fine. For multiple producers there is a window between the call to spin_unlock [F] and the call to __packet_set_status [J] where if there are enough packets added to the ring by other kernel threads then the ring can wrap and multiple threads will end up using the same ring entries. This occurs because the ring entry alocated at [C] did not modify the state of the entry so it continues to appear as owned by the kernel and available for use for new packets even though it has already been allocated. A simple fix is to temporarily mark the ring entries within the spin lock such that user space will still think it?s owned by the kernel and other kernel threads will not see it as available to be used for new packets. If a kernel thread gets delayed between [F] and [J] for an extended period of time and the ring wraps back to the same point then subsiquent kernel threads attempts to allocate will fail and be treated as the ring being full. The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit to prevent other kernel threads from re-using the same entry. Note that any existing bit other than TP_STATUS_USER could have been used. af_packet.c:tpacket_rcv() ... code removed for brevity ... // Acquire spin lock A:spin_lock(>sk_receive_queue.lock); // Preemption is disabled // Get current ring entry B: h.raw = packet_current_rx_frame( po, skb, TP_STATUS_KERNEL, (macoff+snaplen)); // Get out if ring is full // Code not show but it will also release lock