On Wed, 2012-11-14 at 18:28 +0000, Mel Gorman wrote: > On Wed, Nov 14, 2012 at 11:24:42AM -0600, Andrew Theurer wrote: > > > > > From: Peter Zijlstra <[email protected]> > > > > > > Note: The scan period is much larger than it was in the original patch. > > > The reason was because the system CPU usage went through the roof > > > with a sample period of 500ms but it was unsuitable to have a > > > situation where a large process could stall for excessively long > > > updating pte_numa. This may need to be tuned again if a placement > > > policy converges too slowly. > > > > > > Previously, to probe the working set of a task, we'd use > > > a very simple and crude method: mark all of its address > > > space PROT_NONE. > > > > > > That method has various (obvious) disadvantages: > > > > > > - it samples the working set at dissimilar rates, > > > giving some tasks a sampling quality advantage > > > over others. > > > > > > - creates performance problems for tasks with very > > > large working sets > > > > > > - over-samples processes with large address spaces but > > > which only very rarely execute > > > > > > Improve that method by keeping a rotating offset into the > > > address space that marks the current position of the scan, > > > and advance it by a constant rate (in a CPU cycles execution > > > proportional manner). If the offset reaches the last mapped > > > address of the mm then it then it starts over at the first > > > address. > > > > I believe we will have problems with this. For example, running a large > > KVM VM with 512GB memory, using the new defaults in this patch, and > > assuming we never go longer per scan than the scan_period_min, it would > > take over an hour to scan the entire VM just once. The defaults could > > be changed, but ideally there should be no knobs like this in the final > > version, as it should just work well under all conditions. > > > > Good point. I'll switch to the old defaults. The system CPU usage will > be high but that has to be coped with anyway. Ideally the tunables would > go away but for now they are handy for debugging. > > > Also, if such a method is kept, would it be possible to base it on fixed > > number of pages successfully marked instead of a MB range? > > I see a patch for that in the -tip tree. I'm still debating this with > myself. On the one hand, it'll update the PTEs faster. On the other > hand, the time spent scanning is now variable because it depends on the > number of PTE updates. It's no longer a constant in terms of scanning > although it would still be constant in terms of PTEs update. Hmm.. > > > Reason I > > bring it up is that we often can have VMs which are large in their > > memory definition, but might not actually have a lot of pages faulted > > in. We could be "scanning" sections of vma which are not even actually > > present yet. > > > > Ok, thanks for that. That would push me towards accepting it and being > ok with the variable amount of scanning. > > > > The per-task nature of the working set sampling functionality in this tree > > > allows such constant rate, per task, execution-weight proportional > > > sampling > > > of the working set, with an adaptive sampling interval/frequency that > > > goes from once per 2 seconds up to just once per 32 seconds. The current > > > sampling volume is 256 MB per interval. > > > > Once a new section is marked, is the previous section automatically > > reverted? > > No. > > > If not, I wonder if there's risk of building up a ton of > > potential page faults? > > > > Yes, if the full address space is suddenly referenced. > > > > As tasks mature and converge their working set, so does the > > > sampling rate slow down to just a trickle, 256 MB per 32 > > > seconds of CPU time executed. > > > > > > This, beyond being adaptive, also rate-limits rarely > > > executing systems and does not over-sample on overloaded > > > systems. > > > > I am wondering if it would be better to shrink the scan period back to a > > much smaller fixed value, > > I'll do that anyway. > > > and instead of picking 256MB ranges of memory > > to mark completely, go back to using all of the address space, but mark > > only every Nth page. > > It'll still be necessary to do the full walk and I wonder if we'd lose on > the larger number of PTE locks that will have to be taken to do a scan if > we are only updating every 128 pages for example. It could be very expensive.
Yes, good point. My other inclination was not doing a mass marking of pages at all (except just one time at some point after task init) and conditionally setting or clearing the prot_numa in the fault path itself to control the fault rate. The problem I see is I am not sure how we "back-off" the fault rate per page. You could choose to not leave the page marked, but then you never get a fault on that page again, so there's no good way to mark it again in the fault path for that page unless you have the periodic marker. However, maybe a certain number of pages are considered clustered together, and a fault from any page is considered a fault for the cluster of pages. When handling the fault, the number of pages which are marked in the cluster is varied to achieve a target, reasonable fault rate. Might be able to treat page migrations in clusters as well... I probably need to think about this a bit more.... > > > N is adjusted each period to target a rolling > > average of X faults per MB per execution time period. This per task N > > would also be an interesting value to rank memory access frequency among > > tasks and help prioritize scheduling decisions. > > > > It's an interesting idea. I'll think on it more but my initial reaction > is that the cost could be really high. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

