Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity

Ingo Molnar Tue, 23 Jun 2015 01:11:45 -0700

* Srikar Dronamraju <[email protected]> wrote:

> * Rik van Riel <[email protected]> [2015-06-16 10:39:13]:
> 
> > On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> > > This is consistent with all other load balancing instances where we
> > > absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
> > > env->imbalance_pct allows to pull and retain task to their preferred
> > > nodes.
> > >
> > > Signed-off-by: Srikar Dronamraju <[email protected]>
> >
> > How does this work with other workloads, eg.
> > single instance SPECjbb2005, or two SPECjbb2005
> > instances on a four node system?
> >
> > Is the load still balanced evenly between nodes
> > with this patch?
> >
> 
> Yes, I have looked at mpstat logs while running SPECjbb2005 for 1JVMper
> System, 2 JVMs per System and 4 JVMs per System and observed that the
> load spreading was similar with and without this patch.
> 
> Also I have visualized using htop when running 0.5X (i.e 48 threads on
> 96 cpu system) cpu stress workloads to see that the spread is similar
> before and after the patch.
> 
> Please let me know if there are any better ways to observe the
> spread. [...]


There are. I see you are using prehistoric tooling, but see the various NUMA 
convergence latency measurement utilities in 'perf bench numa':

triton:~/tip> perf bench numa mem -h
# Running 'numa/mem' benchmark:

 # Running main, "perf bench numa numa-mem -h"

 usage: perf bench numa <options>

    -p, --nr_proc <n>     number of processes
    -t, --nr_threads <n>  number of threads per process
    -G, --mb_global <MB>  global  memory (MBs)
    -P, --mb_proc <MB>    process memory (MBs)
    -L, --mb_proc_locked <MB>
                          process serialized/locked memory access (MBs), <= 
process_memory
    -T, --mb_thread <MB>  thread  memory (MBs)
    -l, --nr_loops <n>    max number of loops to run
    -s, --nr_secs <n>     max number of seconds to run
    -u, --usleep <n>      usecs to sleep per loop iteration
    -R, --data_reads      access the data via writes (can be mixed with -W)
    -W, --data_writes     access the data via writes (can be mixed with -R)
    -B, --data_backwards  access the data backwards as well
    -Z, --data_zero_memset
                          access the data via glibc bzero only
    -r, --data_rand_walk  access the data with random (32bit LFSR) walk
    -z, --init_zero       bzero the initial allocations
    -I, --init_random     randomize the contents of the initial allocations
    -0, --init_cpu0       do the initial allocations on CPU#0
    -x, --perturb_secs <n>
                          perturb thread 0/0 every X secs, to test convergence 
stability
    -d, --show_details    Show details
    -a, --all             Run all tests in the suite
    -H, --thp <n>         MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE
    -c, --show_convergence
                          show convergence details
    -m, --measure_convergence
                          measure convergence latency
    -q, --quiet           quiet mode
    -S, --serialize-startup
                          serialize thread startup
    -C, --cpus <cpu[,cpu2,...cpuN]>
                          bind the first N tasks to these specific cpus (the 
rest is unbound)
    -M, --memnodes <node[,node2,...nodeN]>
                          bind the first N tasks to these specific memory nodes 
(the rest is unbound)

'-m' will measure convergence.
'-c' will visualize it.
'--thp' can be used to turn hugepages on/off

For example you can create a 'numa02' work-alike by doing:

  vega:~> cat numa02
  #!/bin/bash

  perf bench numa mem --no-data_rand_walk -p 1 -t 32 -G 0 -P 0 -T 32 -l 800 
-zZ0c $@

this perf bench numa command mimics numa02 pretty exactly on a 32 CPU system.

This will run it in a loop:

  vega:~> cat numa02-loop 

  while :; do
    ./numa02 2>&1 | grep runtime-max/thread
    sleep 1
  done

Or here are various numa01 work-alikes:

  vega:~> cat numa01
  perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 3072 -T 0 -l 50 
-zZ0c $@

  vega:~> cat numa01-hard-bind
  ./numa01 --cpus=0-16_16x16#16 --memnodes=0x16,2x16

or numa01-thread-alloc:

  vega:~> cat numa01-THREAD_ALLOC

  perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 0 -T 192 -l 1000 
-zZ0c $@

You can generate very flexible setups of NUMA access patterns, and measure 
their 
behavior accurately.

It's all so much more capable and more flexible than autonumabench ...

Also, when you are trying to report numbers for multiple runs, please use 
something like:

   perf stat --null --repeat 3 ...

This will run the workload 3 times (doing only time measurement) and report the 
stddev in a human readable form.

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity

Reply via email to