On Thu, Mar 9, 2017 at 3:01 AM, Thomas Gleixner <t...@linutronix.de> wrote: > On Wed, 8 Mar 2017, David Carrillo-Cisneros wrote: >> On Wed, Mar 8, 2017 at 12:30 AM, Thomas Gleixner <t...@linutronix.de> wrote: >> > Same applies for per CPU measurements. >> >> For CPU measurements. We need perf-like CPU filtering to support tools >> that perform low overhead monitoring by polling CPU events. These >> tools approximate per-cgroup/task events by reconciling CPU events >> with logs of what job run when in what CPU. > > Sorry, but for CQM that's just voodoo analysis.
I'll argue that. Yet, perf-like CPU is also needed for MBM, a less contentious scenario, I believe. > > CPU default is CAT group 0 (20% of cache) > T1 belongs to CAT group 1 (40% of cache) > T2 belongs to CAT group 2 (40% of cache) > > Now you do low overhead samples of the CPU (all groups accounted) with 1 > second period. > > Lets assume that T1 runs 50% and T2 runs 20% the rest of the time is > utilized by random other things and the kernel itself (using CAT group 0). > > What is the accumulated value telling you? In this single example not much, only the sum of occupancies. But assume I have T1...T10000 different jobs, and I randomly select a pair of those jobs to run together in a machine, (they become the T1 and T2 in your example). Then I repeat that hundreds of thousands of times. I can collect all data with (tasks run, time run, occupancy) and build a simple regression to estimate the expected occupancy (and some confidence interval). That inaccurate but approximate value is very useful to feed into a job scheduler. Furthermore, it can be correlated with values of other events that are currently sampled this way. > > How do you approximate that back to T1/T2 and the rest? Described above for large numbers and random samples. More sophisticated (voodo?) statistic techniques are employed in practice to account for almost all issues I could think of (selection bias, missing values, interaction between tasks, etc). They seem to work fine. > > How do you do that when the tasks are switching between the samples several > times? It does not work well for a single run (your example). But for the example I gave, one can just rely on Random Sampling, Law of Large Numbers, and Central Limit Theorem. Thanks, David