Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Thu, Sep 29, 2016 at 11:15 AM, Eric Dumazet wrote: > On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote: >> On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote: >> >> > It addresses the issue that Rick Jones pointed out was happening with >> > XPS. When packets are sent for a flow that has no socket and XPS is >> > enabled then each packet uses the XPS queue based on the running CPU. >> > Since the thread sending on a flow can be rescheduled on different >> > CPUs this is creating ooo packets. In this case the ooo is being >> > caused by interaction with XPS. >> > >> >> Nope, your patch does not address the problem properly. >> >> I am not sure I want to spend more time explaining the issue. >> >> Lets talk about this in Tokyo next week. >> > > Just as a reminder, sorry to bother you, stating some obvious facts for > both of us. We have public exchanges, so we also need to re-explain how > things work. > > Queue selection on xmit happens before we hit the qdisc and its delays. > > So when you access txq->dql.num_completed_ops and > txq->dql.num_enqueue_ops you can observe values that do not change for a > while. > > Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow > (skb_get_hash() returns the same value for these 2 packets) > > P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the > packet on its qdisc . Transmit does not happen because of some > constraints like rate limiting or scheduling constraints. > > P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior > packet chose txq1. We check txq1->dql and decide it is fine to use txq2, > since the dql params of txq1 were not changed yet. > > ( txq->dql.num_completed_ops == ent.queue_ptr ) > > Note that in RFS case, we have the guarantee that we observe 'live > queues' since they are the per cpu backlog. > > So input_queue_head_incr() and input_queue_tail_incr_save() are > correctly doing the OOO prevention, because a queued packet immediately > changes the state. > > So really your patch works if you have no qdisc, or a non congested > qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really > want to avoid steering P2, P3, ..., PN on this full pfifo while maybe > other txq are idle). Strange attractors are back (check commit > 9b462d02d6dd6 ) > Understood. > You could avoid (ab)using BQL with a different method, grabbing > skb->destructor for the packets that are socketless > > The hash table would simply track the sum of skb->truesize to allow flow > migration. This would be self contained and not intrusive. > Okay, will look that. > > >
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On 09/29/2016 06:18 AM, Eric Dumazet wrote: Well, then what this patch series is solving ? You have a producer of packets running on 8 vcpus in a VM. Packets are exiting the VM and need to be queued on a mq NIC in the hypervisor. Flow X can be scheduled on any of these 8 vcpus, so XPS is currently selecting different TXQ. Just for completeness, in my testing, the VMs were single-vCPU. rick jones
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote: > On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote: > > > It addresses the issue that Rick Jones pointed out was happening with > > XPS. When packets are sent for a flow that has no socket and XPS is > > enabled then each packet uses the XPS queue based on the running CPU. > > Since the thread sending on a flow can be rescheduled on different > > CPUs this is creating ooo packets. In this case the ooo is being > > caused by interaction with XPS. > > > > Nope, your patch does not address the problem properly. > > I am not sure I want to spend more time explaining the issue. > > Lets talk about this in Tokyo next week. > Just as a reminder, sorry to bother you, stating some obvious facts for both of us. We have public exchanges, so we also need to re-explain how things work. Queue selection on xmit happens before we hit the qdisc and its delays. So when you access txq->dql.num_completed_ops and txq->dql.num_enqueue_ops you can observe values that do not change for a while. Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow (skb_get_hash() returns the same value for these 2 packets) P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the packet on its qdisc . Transmit does not happen because of some constraints like rate limiting or scheduling constraints. P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior packet chose txq1. We check txq1->dql and decide it is fine to use txq2, since the dql params of txq1 were not changed yet. ( txq->dql.num_completed_ops == ent.queue_ptr ) Note that in RFS case, we have the guarantee that we observe 'live queues' since they are the per cpu backlog. So input_queue_head_incr() and input_queue_tail_incr_save() are correctly doing the OOO prevention, because a queued packet immediately changes the state. So really your patch works if you have no qdisc, or a non congested qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really want to avoid steering P2, P3, ..., PN on this full pfifo while maybe other txq are idle). Strange attractors are back (check commit 9b462d02d6dd6 ) You could avoid (ab)using BQL with a different method, grabbing skb->destructor for the packets that are socketless The hash table would simply track the sum of skb->truesize to allow flow migration. This would be self contained and not intrusive.
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote: > It addresses the issue that Rick Jones pointed out was happening with > XPS. When packets are sent for a flow that has no socket and XPS is > enabled then each packet uses the XPS queue based on the running CPU. > Since the thread sending on a flow can be rescheduled on different > CPUs this is creating ooo packets. In this case the ooo is being > caused by interaction with XPS. > Nope, your patch does not address the problem properly. I am not sure I want to spend more time explaining the issue. Lets talk about this in Tokyo next week.
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Thu, Sep 29, 2016 at 9:18 AM, Eric Dumazet wrote: > On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote: >> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet >> wrote: >> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote: >> >> xps_flows maintains a per device flow table that is indexed by the >> >> skbuff hash. The table is only consulted when there is no queue saved in >> >> a transmit socket for an skbuff. >> >> >> >> Each entry in the flow table contains a queue index and a queue >> >> pointer. The queue pointer is set when a queue is chosen using a >> >> flow table entry. This pointer is set to the head pointer in the >> >> transmit queue (which is maintained by BQL). >> >> >> >> The new function get_xps_flows_index that looks up flows in the >> >> xps_flows table. The entry returned gives the last queue a matching flow >> >> used. The returned queue is compared against the normal XPS queue. If >> >> they are different, then we only switch if the tail pointer in the TX >> >> queue has advanced past the pointer saved in the entry. In this >> >> way OOO should be avoided when XPS wants to use a different queue. >> > >> > There is something I do not understand with this. >> > >> > If this OOO avoidance is tied to BQL, it means that packets sitting in a >> > qdisc wont be part of the detection. >> > >> > So packets of flow X could possibly be queued on multiple qdiscs. >> > >> Hi Eric, >> >> I'm not sure I understand your concern. If packets for flow X can be >> queued on multiple qdiscs wouldn't that mean that flow will be subject >> to ooo transmission regardless of this patch? In other words here >> we're trying to keep packets for the flow in order as they are emitted >> from the qdiscs which we assume maintains ordering of packets in a >> flow. > > Well, then what this patch series is solving ? > It addresses the issue that Rick Jones pointed out was happening with XPS. When packets are sent for a flow that has no socket and XPS is enabled then each packet uses the XPS queue based on the running CPU. Since the thread sending on a flow can be rescheduled on different CPUs this is creating ooo packets. In this case the ooo is being caused by interaction with XPS. > You have a producer of packets running on 8 vcpus in a VM. > > Packets are exiting the VM and need to be queued on a mq NIC in the > hypervisor. > > Flow X can be scheduled on any of these 8 vcpus, so XPS is currently > selecting different TXQ. > > Your patch aims to detect this and steer packets to one TXQ, but not > using a static hash() % num_real_queues (as RSS would have done on a NIC > at RX), but trying to please XPS (as : selecting a queue close to the > CPU producing the packet. VM with one vcpu, and cpu bounded, would > really be happy to not spread packets all over the queues) > You're not describing this patch, you're describing how XPS works in the first place. This patch is done to make socketless packets work well with XPS. If all you want is a static hash() % num_real_queues then just disable XPS to get that. > Your mechanism relies on counters updated at the time packets are given > to the NIC, not at the time packets are given to the txq (and eventually > sit for a while in the qdisc right before BQL layer) > > dev_queue_xmit() is not talking to the device directly... > > > All I am saying is that the producer/consumer counters you added are not > at the right place. > > You tried hard to not change the drivers, by adding something to > existing BQL. But packets can sit in a qdisc for a while... > But again, this patch is to prevent ooo being caused by an interaction with XPS. It does not address ooo being caused by qdiscs or VMs, but then I don't see that as being the issue reported by Rick. > So you added 2 potential cache lines misses per outgoing packet, but > with no guarantee that OOO are really avoided. > > >
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote: > On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet wrote: > > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote: > >> xps_flows maintains a per device flow table that is indexed by the > >> skbuff hash. The table is only consulted when there is no queue saved in > >> a transmit socket for an skbuff. > >> > >> Each entry in the flow table contains a queue index and a queue > >> pointer. The queue pointer is set when a queue is chosen using a > >> flow table entry. This pointer is set to the head pointer in the > >> transmit queue (which is maintained by BQL). > >> > >> The new function get_xps_flows_index that looks up flows in the > >> xps_flows table. The entry returned gives the last queue a matching flow > >> used. The returned queue is compared against the normal XPS queue. If > >> they are different, then we only switch if the tail pointer in the TX > >> queue has advanced past the pointer saved in the entry. In this > >> way OOO should be avoided when XPS wants to use a different queue. > > > > There is something I do not understand with this. > > > > If this OOO avoidance is tied to BQL, it means that packets sitting in a > > qdisc wont be part of the detection. > > > > So packets of flow X could possibly be queued on multiple qdiscs. > > > Hi Eric, > > I'm not sure I understand your concern. If packets for flow X can be > queued on multiple qdiscs wouldn't that mean that flow will be subject > to ooo transmission regardless of this patch? In other words here > we're trying to keep packets for the flow in order as they are emitted > from the qdiscs which we assume maintains ordering of packets in a > flow. Well, then what this patch series is solving ? You have a producer of packets running on 8 vcpus in a VM. Packets are exiting the VM and need to be queued on a mq NIC in the hypervisor. Flow X can be scheduled on any of these 8 vcpus, so XPS is currently selecting different TXQ. Your patch aims to detect this and steer packets to one TXQ, but not using a static hash() % num_real_queues (as RSS would have done on a NIC at RX), but trying to please XPS (as : selecting a queue close to the CPU producing the packet. VM with one vcpu, and cpu bounded, would really be happy to not spread packets all over the queues) Your mechanism relies on counters updated at the time packets are given to the NIC, not at the time packets are given to the txq (and eventually sit for a while in the qdisc right before BQL layer) dev_queue_xmit() is not talking to the device directly... All I am saying is that the producer/consumer counters you added are not at the right place. You tried hard to not change the drivers, by adding something to existing BQL. But packets can sit in a qdisc for a while... So you added 2 potential cache lines misses per outgoing packet, but with no guarantee that OOO are really avoided.
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet wrote: > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote: >> xps_flows maintains a per device flow table that is indexed by the >> skbuff hash. The table is only consulted when there is no queue saved in >> a transmit socket for an skbuff. >> >> Each entry in the flow table contains a queue index and a queue >> pointer. The queue pointer is set when a queue is chosen using a >> flow table entry. This pointer is set to the head pointer in the >> transmit queue (which is maintained by BQL). >> >> The new function get_xps_flows_index that looks up flows in the >> xps_flows table. The entry returned gives the last queue a matching flow >> used. The returned queue is compared against the normal XPS queue. If >> they are different, then we only switch if the tail pointer in the TX >> queue has advanced past the pointer saved in the entry. In this >> way OOO should be avoided when XPS wants to use a different queue. > > There is something I do not understand with this. > > If this OOO avoidance is tied to BQL, it means that packets sitting in a > qdisc wont be part of the detection. > > So packets of flow X could possibly be queued on multiple qdiscs. > Hi Eric, I'm not sure I understand your concern. If packets for flow X can be queued on multiple qdiscs wouldn't that mean that flow will be subject to ooo transmission regardless of this patch? In other words here we're trying to keep packets for the flow in order as they are emitted from the qdiscs which we assume maintains ordering of packets in a flow. Tom > >
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote: > xps_flows maintains a per device flow table that is indexed by the > skbuff hash. The table is only consulted when there is no queue saved in > a transmit socket for an skbuff. > > Each entry in the flow table contains a queue index and a queue > pointer. The queue pointer is set when a queue is chosen using a > flow table entry. This pointer is set to the head pointer in the > transmit queue (which is maintained by BQL). > > The new function get_xps_flows_index that looks up flows in the > xps_flows table. The entry returned gives the last queue a matching flow > used. The returned queue is compared against the normal XPS queue. If > they are different, then we only switch if the tail pointer in the TX > queue has advanced past the pointer saved in the entry. In this > way OOO should be avoided when XPS wants to use a different queue. There is something I do not understand with this. If this OOO avoidance is tied to BQL, it means that packets sitting in a qdisc wont be part of the detection. So packets of flow X could possibly be queued on multiple qdiscs.
[PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
xps_flows maintains a per device flow table that is indexed by the skbuff hash. The table is only consulted when there is no queue saved in a transmit socket for an skbuff. Each entry in the flow table contains a queue index and a queue pointer. The queue pointer is set when a queue is chosen using a flow table entry. This pointer is set to the head pointer in the transmit queue (which is maintained by BQL). The new function get_xps_flows_index that looks up flows in the xps_flows table. The entry returned gives the last queue a matching flow used. The returned queue is compared against the normal XPS queue. If they are different, then we only switch if the tail pointer in the TX queue has advanced past the pointer saved in the entry. In this way OOO should be avoided when XPS wants to use a different queue. Signed-off-by: Tom Herbert --- net/Kconfig| 6 net/core/dev.c | 87 +- 2 files changed, 74 insertions(+), 19 deletions(-) diff --git a/net/Kconfig b/net/Kconfig index 7b6cd34..f77fad1 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -255,6 +255,12 @@ config XPS depends on SMP default y +config XPS_FLOWS + bool + depends on XPS + depends on BQL + default y + config HWBM bool diff --git a/net/core/dev.c b/net/core/dev.c index c0c291f..1ca08b9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) } #endif /* CONFIG_NET_EGRESS */ +/* Must be called with RCU read_lock */ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) { #ifdef CONFIG_XPS @@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) struct xps_map *map; int queue_index = -1; - rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_maps); if (dev_maps) { map = rcu_dereference( @@ -3228,15 +3228,62 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) else queue_index = map->queues[reciprocal_scale(skb_get_hash(skb), map->len)]; - if (unlikely(queue_index >= dev->real_num_tx_queues)) - queue_index = -1; + if (queue_index >= 0 && + likely(queue_index < dev->real_num_tx_queues)) + return queue_index; } } - rcu_read_unlock(); +#endif + return skb_tx_hash(dev, skb); +} + +/* Must be called with RCU read_lock */ +static int get_xps_flows_index(struct net_device *dev, struct sk_buff *skb) +{ +#ifdef CONFIG_XPS_FLOWS + struct xps_dev_flow_table *flow_table; + struct xps_dev_flow ent; + int queue_index; + struct netdev_queue *txq; + u32 hash; + + queue_index = get_xps_queue(dev, skb); + if (queue_index < 0) + return -1; + + flow_table = rcu_dereference(dev->xps_flow_table); + if (!flow_table) + return queue_index; + + hash = skb_get_hash(skb); + if (!hash) + return queue_index; + + ent.v64 = flow_table->flows[hash & flow_table->mask].v64; + + if (queue_index != ent.queue_index && + ent.queue_index >= 0 && + ent.queue_index < dev->real_num_tx_queues) { + txq = netdev_get_tx_queue(dev, ent.queue_index); + if ((int)(txq->dql.num_completed_ops - ent.queue_ptr) < 0) { + /* The current queue's tail has not advanced beyond the +* last packet that was enqueued using the table entry. +* We can't change queues without risking OOO. Stick +* with the queue listed in the flow table. +*/ + queue_index = ent.queue_index; + } + } + + /* Save the updated entry */ + txq = netdev_get_tx_queue(dev, queue_index); + ent.queue_index = queue_index; + ent.queue_ptr = txq->dql.num_enqueue_ops; + flow_table->flows[hash & flow_table->mask].v64 = ent.v64; return queue_index; #else - return -1; + return get_xps_queue(dev, skb); #endif } @@ -3244,22 +3291,24 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) { struct sock *sk = skb->sk; int queue_index = sk_tx_queue_get(sk); - - if (queue_index < 0 || skb->ooo_okay || - queue_index >= dev->real_num_tx_queues) { - int new_index = get_xps_queue(dev, skb); - if (new_index < 0) - new_index = skb_tx_hash(dev, skb); - - if (queue_index != new_index && sk && - sk_full