Hello, Well I did get to it before you:
http://review.source.kitware.com/#/c/6614/ I also uped the size of the image 100x in your test, here is the current performance on my system: System: victoria.nlm.nih.gov Processor: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Serial #: Cache: 32768 Clock: 2794.27 Cores: 12 cpus x 24 Cores = 288 OSName: Mac OS X Release: 10.6.8 Version: 10K549 Platform: x86_64 Operating System is 64 bit ITK Version: 3.20.1 Virtual Memory: Total: 256 Available: 228 Physical Memory: Total:65536 Available: 58374 Probe Name: Count Min Mean Stdev Max Total MeanSquares_1_threads 20 0.344348 0.347567 0.00244733 0.352629 6.95134 MeanSquares_2_threads 20 0.251223 0.300869 0.0179305 0.321404 6.01738 MeanSquares_4_threads 20 0.215516 0.348677 0.173645 0.678274 6.97355 MeanSquares_8_threads 20 0.138184 0.182681 0.0297812 0.237129 3.65362 System: victoria.nlm.nih.gov Processor: Serial #: Cache: 32768 Clock: 2930 Cores: 12 cpus x 24 Cores = 288 OSName: Mac OS X Release: 10.6.8 Version: 10K549 Platform: x86_64 Operating System is 64 bit ITK Version: 4.2.0 Virtual Memory: Total: 256 Available: 228 Physical Memory: Total:65536 Available: 58371 Probe Name: Count Min Mean Stdev Max Total MeanSquares_1_threads 20 0.382481 0.383342 0.00186954 0.391027 7.66685 MeanSquares_2_threads 20 0.211908 0.335328 0.0777408 0.435574 6.70655 MeanSquares_4_threads 20 0.271531 0.315688 0.0390751 0.385683 6.31377 MeanSquares_8_threads 20 0.147544 0.192132 0.0299427 0.240976 3.84263 In the patch provided, it is implicitly done on assignment on a per-thread basis. What was most un-expected was when then allocation of the Jacobin was explicitly done out side the threaded part, the time when up by 50%! I presume that the sequential allocation, of the doubles in the master thread made the allocation sequentially, next to each other, and may be a more insidious form of false sharing. Below is the numbers from this run, notice the lack of speed up with more threads: System: victoria.nlm.nih.gov Processor: Serial #: Cache: 32768 Clock: 2930 Cores: 12 cpus x 24 Cores = 288 OSName: Mac OS X Release: 10.6.8 Version: 10K549 Platform: x86_64 Operating System is 64 bit ITK Version: 4.2.0 Virtual Memory: Total: 256 Available: 226 Physical Memory: Total:65536 Available: 57091 Probe Name: Count Min Mean Stdev Max Total MeanSquares_1_threads 20 0.403931 0.40648 0.00213043 0.41389 8.1296 MeanSquares_2_threads 20 0.243789 0.367603 0.0894637 0.65006 7.35206 MeanSquares_4_threads 20 0.281336 0.354749 0.0431082 0.440161 7.09497 MeanSquares_8_threads 20 0.24615 0.301576 0.0552998 0.446528 6.03151 Brad On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote: > Brad, > > The false sharing issue is a good point - however, i dont think this is the > cause of the performance degradation. This part of the class (m_Threader, > etc) has not changed since 3.20. (I used the optimized metrics in my 3.20 > builds, so its in Review/itkOptMeanSquares....) It also does not explain the > performance drop in single threaded mode. > > Testing will tell... Seems like a Friday afternoon project to me, unless > someone else gets there first. > > Rupert > > -------------------------------------------------------------- > Rupert Brooks > [email protected] > > > > On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <[email protected]> > wrote: > Hello, > > Continuing to glance at the class.... I also see the following member > variables for the MeanSquares class: > > MeasureType * m_ThreaderMSE; > DerivativeType *m_ThreaderMSEDerivatives; > > Where these are index by the thread ID and access simultaneously across the > threads causes the potential for False Sharing, which can be a MAJOR problem > with threaded algorithms. > > I would think a good solution would be to create a per-thread data structure > consisting of the Jacobin, MeasureType, and DerivativeType, plus padding to > prevent false sharing, or equivalently assigning max data alignment to the > structure. > > Rupert, Would like to take a stab at this fix? > > Brad > > > On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote: > >> Sorry if this repeats - i just got a bounce from Insight Developers, so im >> trimming the message and resending.... >> -------------------------------------------------------------- >> Rupert Brooks >> [email protected] >> >> >> >> On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <[email protected]> >> wrote: >> Aha. Heres around line 183 of itkTranslationTransform. >> >> // Compute the Jacobian in one position >> template <class TScalarType, unsigned int NDimensions> >> void >> TranslationTransform<TScalarType, >> NDimensions>::ComputeJacobianWithRespectToParameters( >> const InputPointType &, >> JacobianType & jacobian) const >> { >> // the Jacobian is constant for this transform, and it has already been >> // initialized in the constructor, so we just need to return it here. >> jacobian = this->m_IdentityJacobian; >> return; >> } >> >> Thats probably the culprit, although the root cause may be the reallocating >> of the jacobian every time through the loop. >> >> Rupert >> >> <snipped> > > ======================================================== Bradley Lowekamp Medical Science and Computing for Office of High Performance Computing and Communications National Library of Medicine [email protected]
_______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Kitware offers ITK Training Courses, for more information visit: http://kitware.com/products/protraining.php Please keep messages on-topic and check the ITK FAQ at: http://www.itk.org/Wiki/ITK_FAQ Follow this link to subscribe/unsubscribe: http://www.itk.org/mailman/listinfo/insight-developers
