This patch series implements the cleanups suggested by Peter and Andy, removes lazy TLB mm refcounting on x86, and shows how other architectures could implement that same optimization.
The previous patch series already seems to have removed most of the cache line contention I was seeing at context switch time, so CPU use of the memcache and memcache-like workloads has not changed measurably with this patch series. However, the memory bandwidth used by the memcache system has been reduced by about 1%, to serve the same number of queries per second. This happens on two socket Haswell and Broadwell systems. Maybe on larger systems (4 or 8 socket) one might also see a measurable drop in the amount of CPU time used, with workloads where the previous patch series does not remove all cache line contention on the mm. This is against the latest -tip tree, and seems to be stable (on top of another tree) with workloads that do over a million context switches a second. V2 of the series uses Peter's logic flow in context_switch(), and is a step in the direction of completely getting rid of ->active_mm.