On Wed, 13 Sep 2017, Tim Chen wrote: > Here's what the customer think happened and is willing to tell us. > They have a parent process that spawns off 10 children per core and > kicked them to run. The child processes all access a common library. > We have 384 cores so 3840 child processes running. When migration occur on > a page in the common library, the first child that access the page will > page fault and lock the page, with the other children also page faulting > quickly and pile up in the page wait list, till the first child is done.
I think we need some way to avoid migration in cases like this. This is crazy. Page migration was not written to deal with something like this.

