Sorry for not being clear! I got too excited by finding the solution to the 
performance issue with ITKv3 registration in ITKv4.

This first is vanilla 3.20, the second is 4.20+ the gerrit patch. The third is 
the gerrit patch with the pre-malloc of the Jacobin outside the threaded 
section! Vanilla 4.2 is ~2x 3.20 for this test on my system too.

Summary for the MeansSquares metric in your test:

3.20:  1X
4.2: 2+X
4.2+gerrit patch: 1X
4.2+gerrit patch + single-threaded preallocation of jacobian: 1.5X

On Jul 26, 2012, at 12:48 PM, Rupert Brooks wrote:

> Brad,
> 
> Am i reading this right? - the second set of numbers is vanilla ITK 4.2 and 
> the third set is with your patch?

no.

> 
> a) Single threaded, allocating outside the threaded part made the mean time 
> go up from .38 to .4 in the single threaded case, and it seems to be a real 
> effect.  

That is not a significant change, compare the to effect of CPU temperature and 
other outside influences.

> b) After allocation outside the thread, the multithread performance is worse 
> both absolutely, and in the serial/parallel percentage.
> 
> Wierd.
> 
> I also note that going from 4 to 8 threads seems to get you a greater ratio 
> of improvement than going from 1 to 2, or 2 to 4.  Especially with the 
> vanilla itk 4.2. Generally parallel improvement drops off with number of 
> threads, because you are only subdividing the parallel part of the algorithm. 
>  I have found in the past that this sort of wierdness can occur with NUMA 
> systems depending whether the OS puts multiple threads on the same physical 
> chip or not.

That effect is not on vanilla 4.2 it's on the gerrit patched, I agree that with 
multi-socket, multi-core with hyper-threading oddities to to the location of 
caches frequently occur.

> 
> Are you really fortunate enough to have 288 cores in that Mac, or is it more 
> like 3 chips at 8 cores each or something?

I have a dual Xeon mac pro. It has 2 Physical cores, each with 6 cores, which 
can be hyper-threaded.

> 
> My personal tendency would be to try to understand why the single thread 
> performance got so much worse, then (if relevant) try and fix the threading.  
> ITK 3.20 doesnt do very well with increasing the numbers of threads either.

The single threaded performance is worse because there is a malloc occurring 
for every sample. My patch addressed that.

> 
> Just to reiterate where im coming from - i want to show that accuracy and 
> performance wise ITK4 is at least no worse than ITK3.  That justifies 
> upgrading.  In the longer run, it may make more sense to rewrite old ITK 
> stuff to use the new metrics, but that is a separate project.  

My intention was to demonstrate that with the patch in gerrit the performance 
is about the same as ITKv3.

Brad

> 
> Rupert
> --------------------------------------------------------------
> Rupert Brooks
> [email protected]
> 
> 
> 
> On Thu, Jul 26, 2012 at 11:53 AM, Bradley Lowekamp <[email protected]> 
> wrote:
> Hello,
> 
> Well I did get to it before you:
> 
> http://review.source.kitware.com/#/c/6614/
> 
> I also uped the size of the image 100x  in your test, here is the current 
> performance on my system:
> 
> System: victoria.nlm.nih.gov
> Processor: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz
>  Serial #: 
>     Cache: 32768
>     Clock: 2794.27
>     Cores: 12 cpus x 24 Cores = 288
> OSName:     Mac OS X
>   Release:  10.6.8
>   Version:  10K549
>   Platform: x86_64
>   Operating System is 64 bit
> ITK Version: 3.20.1
> Virtual Memory: Total: 256 Available: 228
> Physical Memory: Total:65536 Available: 58374
>            Probe Name:        Count          Min           Mean         Stdev 
>            Max        Total 
>  MeanSquares_1_threads            20      0.344348      0.347567    
> 0.00244733      0.352629       6.95134
>  MeanSquares_2_threads            20      0.251223      0.300869     
> 0.0179305      0.321404       6.01738
>  MeanSquares_4_threads            20      0.215516      0.348677      
> 0.173645      0.678274       6.97355
>  MeanSquares_8_threads            20      0.138184      0.182681     
> 0.0297812      0.237129       3.65362
> System: victoria.nlm.nih.gov
> Processor: 
>  Serial #: 
>     Cache: 32768
>     Clock: 2930
>     Cores: 12 cpus x 24 Cores = 288
> OSName:     Mac OS X
>   Release:  10.6.8
>   Version:  10K549
>   Platform: x86_64
>   Operating System is 64 bit
> ITK Version: 4.2.0
> Virtual Memory: Total: 256 Available: 228
> Physical Memory: Total:65536 Available: 58371
>            Probe Name:        Count          Min           Mean         Stdev 
>            Max        Total 
>  MeanSquares_1_threads            20      0.382481      0.383342    
> 0.00186954      0.391027       7.66685
>  MeanSquares_2_threads            20      0.211908      0.335328     
> 0.0777408      0.435574       6.70655
>  MeanSquares_4_threads            20      0.271531      0.315688     
> 0.0390751      0.385683       6.31377
>  MeanSquares_8_threads            20      0.147544      0.192132     
> 0.0299427      0.240976       3.84263
> 
> 
> In the patch provided, it is implicitly done on assignment on a per-thread 
> basis. What was most un-expected was when then allocation of the Jacobin was 
> explicitly done out side the threaded part, the time when up by 50%! I 
> presume that the sequential allocation, of the doubles in the master thread 
> made the allocation sequentially, next to each other, and may be a more 
> insidious form of false sharing. Below is the numbers from this run, notice 
> the lack of speed up with more threads:
> 
> System: victoria.nlm.nih.gov
> Processor: 
>  Serial #: 
>     Cache: 32768
>     Clock: 2930
>     Cores: 12 cpus x 24 Cores = 288
> OSName:     Mac OS X
>   Release:  10.6.8
>   Version:  10K549
>   Platform: x86_64
>   Operating System is 64 bit
> ITK Version: 4.2.0
> Virtual Memory: Total: 256 Available: 226
> Physical Memory: Total:65536 Available: 57091
>            Probe Name:        Count          Min           Mean         Stdev 
>            Max        Total 
>  MeanSquares_1_threads            20      0.403931       0.40648    
> 0.00213043       0.41389        8.1296
>  MeanSquares_2_threads            20      0.243789      0.367603     
> 0.0894637       0.65006       7.35206
>  MeanSquares_4_threads            20      0.281336      0.354749     
> 0.0431082      0.440161       7.09497
>  MeanSquares_8_threads            20       0.24615      0.301576     
> 0.0552998      0.446528       6.03151
> 
> 
> Brad
> 
> 
> On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote:
> 
>> Brad,
>> 
>> The false sharing issue is a good point - however, i dont think this is the 
>> cause of the performance degradation.  This part of the class (m_Threader, 
>> etc) has not changed since 3.20.  (I used the optimized metrics in my 3.20 
>> builds, so its in Review/itkOptMeanSquares....) It also does not explain the 
>> performance drop in single threaded mode.
>> 
>> Testing will tell...  Seems like a Friday afternoon project to me, unless 
>> someone else gets there first.
>> 
>> Rupert
>> 
>> --------------------------------------------------------------
>> Rupert Brooks
>> [email protected]
>> 
>> 
>> 
>> On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <[email protected]> 
>> wrote:
>> Hello,
>> 
>> Continuing to glance at the class.... I also see the following member 
>> variables for the MeanSquares class:
>> 
>>   MeasureType *   m_ThreaderMSE;
>>   DerivativeType *m_ThreaderMSEDerivatives;
>> 
>> Where these are index by the thread ID and access simultaneously across the 
>> threads causes the potential for False Sharing, which can be a MAJOR problem 
>> with threaded algorithms.
>> 
>> I would think a good solution would be to create a per-thread data structure 
>> consisting of the Jacobin, MeasureType, and DerivativeType, plus padding to 
>> prevent false sharing, or equivalently assigning max data alignment to the 
>> structure.
>> 
>> Rupert, Would like to take a stab at this fix?
>> 
>> Brad
>> 
>> 
>> On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote:
>> 
>>> Sorry if this repeats - i just got a bounce from Insight Developers, so im 
>>> trimming the message and resending....
>>> --------------------------------------------------------------
>>> Rupert Brooks
>>> [email protected]
>>> 
>>> 
>>> 
>>> On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <[email protected]> 
>>> wrote:
>>> Aha.  Heres around line 183 of itkTranslationTransform.
>>> 
>>> // Compute the Jacobian in one position
>>> template <class TScalarType, unsigned int NDimensions>
>>> void
>>> TranslationTransform<TScalarType, 
>>> NDimensions>::ComputeJacobianWithRespectToParameters(
>>>   const InputPointType &,
>>>   JacobianType & jacobian) const
>>> {
>>>   // the Jacobian is constant for this transform, and it has already been
>>>   // initialized in the constructor, so we just need to return it here.
>>>   jacobian = this->m_IdentityJacobian;
>>>   return;
>>> }
>>> 
>>> Thats probably the culprit, although the root cause may be the reallocating 
>>> of the jacobian every time through the loop.
>>> 
>>> Rupert
>>> 
>>> <snipped>
>> 
>> 
> 
> ========================================================
> Bradley Lowekamp  
> Medical Science and Computing for
> Office of High Performance Computing and Communications
> National Library of Medicine 
> [email protected]
> 
> 
> 
> 

========================================================
Bradley Lowekamp  
Medical Science and Computing for
Office of High Performance Computing and Communications
National Library of Medicine 
[email protected]



_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html

Kitware offers ITK Training Courses, for more information visit:
http://kitware.com/products/protraining.php

Please keep messages on-topic and check the ITK FAQ at:
http://www.itk.org/Wiki/ITK_FAQ

Follow this link to subscribe/unsubscribe:
http://www.itk.org/mailman/listinfo/insight-developers

Reply via email to