With large buffers -- like 4 K -- putting the router in between two iperfs on the 40 Gbit/sec link really does improve upon an iperf-only test.
I confirmed my suspicion/memory -- the reason is, in the iperf-only test, the receiver goes to 99.7% -- 100.0% CPU. It hits 100% frequently. And it is single-threaded, apparently, getting no benefit from multiple cores. However, wIth the router in between, iperf receiver CPU drops to 85% or so -- and the overall throughput (as measured at the receiver side) actually improves noticeably.
