Found a really simple solution that almost restores the original performance: just add a prefetch on alloc. For some reason, I assumed that this was already done since the troublesome commit I investigated mentioned something about prefetching... I guess the commit referred to the hardware prefetcher in the CPU.
Adding an explicit prefetch command in the mbuf alloc function gives a throughput of 12.7/10.35 Mpps in my benchmark with the simple/full-featured tx path. DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since I'm primarily interested in the full-featured path and the drop from 10.7 to ~10.4 was due to another change. Patch: https://github.com/dpdk-org/dpdk/pull/2 I also sent an email to the mailing list. I also think that the rx-path could also benefit from prefetching somewhere. Paul