On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote: > I'm still in flux/undecided how long we should delay the first touching > of pkt-data, which happens when calling eth_type_trans(). Should it > stay in the driver or not(?). > > In the extreme case, for optimize for RPS sending to remote CPUs, delay > calling eth_type_trans() as long as possible. > > 1. In driver only start prefetch data to L2/L3 cache > 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash > 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue > 4. On remote CPU in process_backlog call eth_type_trans() on > sd->input_pkt_queue > > > On the other hand, if the HW desc can provide skb->proto, and we can > lazy eval skb->pkt_type, then it is okay to keep that responsibility in > the driver (as the call to eth_type_trans() basically disappears).
Delaying means GRO wont be able to recycle its super hot skb (see napi_get_frags()) You might optimize the reception of packets in the router case (poor GRO aggregation rate), but you'll slow down GRO efficiency when receiving nice GRO trains. When we receive a train of 10 MSS, driver keeps using the same sk_buff, very hot in its L1 (This was the original idea of build_skb() to get nice cache locality for the metadata, since it is 4 cache lines per sk_buff) Now most drivers have no clue why it is important to allocate the skb _after_ receiving the ethernet frame and not in advance. (The lazy drivers allocate ~1024 skbs to prefill their ~1024 slot RX ring)