Re: [ITK-dev] [ITK] New Highly Parallel Build System, the POWER8

Bradley Lowekamp Fri, 24 Apr 2015 07:36:45 -0700

Hello Chuck,

Thanks for running and posting those performance numbers. It sadly seems like 
1:1 is most frequently the most efficient use of CPU cycles.


It's interesting to see how this architectures scales with a large number of 
processes, while each core is designed for 8 lighter weight threads is seems.

I was hoping to run similar performance test on lhcp-rh6 with 80 virtual cores 
4 sockets each with 10 cores + hyper-theading. Unfortunately I need to use 
ninja more as my timing results appear to be from cached compilations and not 
actually running the compiler.

I hope were are able to improve ITK threading performance with this system. But 
due to the tweaky-ness of this type of performance and not having direct access 
to the system to easily run performance analysis, I and a little unclear how to 
best utilize it.

Thanks!
Brad

On Apr 23, 2015, at 6:21 PM, Chuck Atkins <[email protected]> wrote:

> In case anybody's interested, here's the "spread_numa.sh" script I use to 
> evenly distribute across NUMA domains and bind to CPU cores:
> 
> ----------BEGIN spread_numa.sh----------
> #!/bin/bash
> 
> # Evenly spread a command across numa domains for a given number of CPU cores
> function spread()
> {
>   NUM_CORES=$1
>   shift
> 
>   # Use this wicked awk script to parse the numactl hardware layout and
>   # select an equal number of cores from each NUMA domain, evenly spaced
>   # across each domain
>   SPREAD="$(numactl -H | sed -n 's|.*cpus: \(.*\)|\1|p' | awk -v 
> NC=${NUM_CORES} -v ND=${NUMA_DOMAINS} 'BEGIN{CPD=NC/ND} {S=NF/CPD; 
> for(C=0;C<CPD;C++){F0=C*S; F1=(F0==int(F0)?F0:int(F0)+1)+1; printf("%d", 
> $F1); if(!(NR==ND && C==CPD-1)){printf(",")} } }')"
> 
>   echo Executing: numactl --physcpubind=${SPREAD} "$@"
>   numactl --physcpubind=${SPREAD} "$@"
> }
> 
> # Check command arguments
> if [ $# -lt 2 ]
> then
>   echo "Usage: $0 [NUM_CORES_TO_USE] [cmd [arg1] ... [argn]]"
>   exit 1
> fi
> 
> # Determine the total number of CPU cores
> MAX_CORES=$(numactl -s | sed -n 's|physcpubind: \(.*\)|\1|p' | wc -w)
> 
> # Determine the total number of NUMA domains
> NUMA_DOMAINS=$(numactl -H | sed -n 's|available: \([0-9]*\).*|\1|p')
> 
> # Verify the number of cores is sane
> NUM_CORES=$1
> shift
> if [ $NUM_CORES -gt $MAX_CORES ]
> then
>   echo "WARNING: $NUM_CORES cores is out of bounds.  Setting to $MAX_CORES 
> cores."
>   NUM_CORES=$MAX_CORES
> fi
> if [ $((NUM_CORES%NUMA_DOMAINS)) -ne 0 ]
> then
>   TMP=$(( ((NUM_CORES/NUMA_DOMAINS) + 1) * NUMA_DOMAINS ))
>   echo "WARNING: $NUM_CORES core(s) are not evenly divided across 
> $NUMA_DOMAINS NUMA domains.  Setting to $TMP."
>   NUM_CORES=$TMP
> fi
> 
> echo "Using ${NUM_CORES}/${MAX_CORES} cores across ${NUMA_DOMAINS} NUMA 
> domains"
> 
> spread ${NUM_CORES} "$@"
> ----------END spread_numa.sh----------
> 
> 
> - Chuck
> 
> On Thu, Apr 23, 2015 at 4:57 PM, Chuck Atkins <[email protected]> 
> wrote:
> (re-sent for the rest of the dev list)
> Hi Bradley,
> 
> It's pretty fast. The interesting numbers are for 20, 40, 80, and 160.  That 
> aligns with 1:1, 2:1, 4:1, and 8:1 threads to core ratio.  Starting from the 
> already configured ITKLinuxPOWER8 currently being built, I did a ninja clean 
> and then "time ninja -jN".  Watching the cpu load for 20, 40, and 80 cores 
> though, I see a fair amount of both process migration and unbalanced thread 
> distribution, i.e. for -j20 I'll often see 2 cores with 6 or 8 threads and 
> the rest with only 1 or 2.  So in addition to the -jN settings, I also ran 
> 20, 40, and 80 threads using numactl with fixed binding to physical CPU cores 
> to evenly distribute the threads across cores and prevent thread migration.  
> See timings below in seconds:
> 
> Threads       Real    User    Sys     Total CPU Time
> 20    1037.097        19866.685       429.796 20296.481
> (Numa Bind) 20        915.910 16290.589       319.017 16609.606
> 40    713.772 26953.663       556.960 27510.623
> (Numa Bind) 40        641.924 22442.685       432.379 22875.064
> 80    588.357 40970.439       822.944 41793.383
> (Numa Bind) 80        538.801 35366.297       637.922 36004.219
> 160   572.492 62542.901       1289.864        63832.765
> (Numa Bind) 160       549.742 61864.666       1242.975        63107.641
> 
> 
> 
> So it seems like core binding gives us an approximate 10% performance 
> increase for all thread configurations.  And while clearly the core-locked 
> 4:1 gave us the best time, looking at the total CPU time (user+sys) the 1:1 
> looks to be the most efficient for actual cycles used.
> 
> It's interesting to watch how the whole system gets used up for most of the 
> build but everything gets periodically gated on a handful of linker 
> processes.  And of course, it's always cool to see a screen cap of htop with 
> a whole boat load of cores at 100%
> 
> 
> - Chuck
> 
> On Thu, Apr 23, 2015 at 10:01 AM, Bradley Lowekamp <[email protected]> 
> wrote:
> Matt,
> 
> I'd love to explore the build performance of this system.
> 
> Any chance you could run clean builds of ITK on this system with 
> 20,40,60,80,100,120,140 and 160 processes and record the timings?
> 
> I am very curious how this unique systems scales with multiple heavy weight 
> processes, as it's design appears to be uniquely suitable to lighter weight 
> multi-threading.
> 
> Thanks,
> Brad
> 
> On Apr 22, 2015, at 11:51 PM, Matt McCormick <[email protected]> 
> wrote:
> 
> > Hi folks,
> >
> > With thanks to Chuck Atkins and FSF France, we have a new build on the
> > dashboard [1] for the IBM POWER8 [2] system.  This is a PowerPC64
> > system with 20 cores and 8 threads per core -- a great system where we
> > can test and improve ITK parallel computing performance!
> >
> >
> > To generate a test build on Gerrit, add
> >
> >  request build: power8
> >
> > in a review's comments.
> >
> >
> > There are currently some build warnings and test failures that should
> > be addressed before we will be able to use the system effectively. Any
> > help here is appreciated.
> >
> > Thanks,
> > Matt
> >
> >
> > [1] 
> > https://open.cdash.org/index.php?project=Insight&date=2015-04-22&filtercount=1&showfilters=1&field1=site/string&compare1=63&value1=gcc112
> >
> > [2] https://en.wikipedia.org/wiki/POWER8
> > _______________________________________________
> > Powered by www.kitware.com
> >
> > Visit other Kitware open-source projects at
> > http://www.kitware.com/opensource/opensource.html
> >
> > Kitware offers ITK Training Courses, for more information visit:
> > http://kitware.com/products/protraining.php
> >
> > Please keep messages on-topic and check the ITK FAQ at:
> > http://www.itk.org/Wiki/ITK_FAQ
> >
> > Follow this link to subscribe/unsubscribe:
> > http://public.kitware.com/mailman/listinfo/insight-developers
> > _______________________________________________
> > Community mailing list
> > [email protected]
> > http://public.kitware.com/mailman/listinfo/community
> 
> 
>

_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html

Kitware offers ITK Training Courses, for more information visit:
http://kitware.com/products/protraining.php

Please keep messages on-topic and check the ITK FAQ at:
http://www.itk.org/Wiki/ITK_FAQ

Follow this link to subscribe/unsubscribe:
http://public.kitware.com/mailman/listinfo/insight-developers

Re: [ITK-dev] [ITK] New Highly Parallel Build System, the POWER8

Reply via email to