Sorry for not being clear! I got too excited by finding the solution to the performance issue with ITKv3 registration in ITKv4.
This first is vanilla 3.20, the second is 4.20+ the gerrit patch. The third is the gerrit patch with the pre-malloc of the Jacobin outside the threaded section! Vanilla 4.2 is ~2x 3.20 for this test on my system too. Summary for the MeansSquares metric in your test: 3.20: 1X 4.2: 2+X 4.2+gerrit patch: 1X 4.2+gerrit patch + single-threaded preallocation of jacobian: 1.5X On Jul 26, 2012, at 12:48 PM, Rupert Brooks wrote: > Brad, > > Am i reading this right? - the second set of numbers is vanilla ITK 4.2 and > the third set is with your patch? no. > > a) Single threaded, allocating outside the threaded part made the mean time > go up from .38 to .4 in the single threaded case, and it seems to be a real > effect. That is not a significant change, compare the to effect of CPU temperature and other outside influences. > b) After allocation outside the thread, the multithread performance is worse > both absolutely, and in the serial/parallel percentage. > > Wierd. > > I also note that going from 4 to 8 threads seems to get you a greater ratio > of improvement than going from 1 to 2, or 2 to 4. Especially with the > vanilla itk 4.2. Generally parallel improvement drops off with number of > threads, because you are only subdividing the parallel part of the algorithm. > I have found in the past that this sort of wierdness can occur with NUMA > systems depending whether the OS puts multiple threads on the same physical > chip or not. That effect is not on vanilla 4.2 it's on the gerrit patched, I agree that with multi-socket, multi-core with hyper-threading oddities to to the location of caches frequently occur. > > Are you really fortunate enough to have 288 cores in that Mac, or is it more > like 3 chips at 8 cores each or something? I have a dual Xeon mac pro. It has 2 Physical cores, each with 6 cores, which can be hyper-threaded. > > My personal tendency would be to try to understand why the single thread > performance got so much worse, then (if relevant) try and fix the threading. > ITK 3.20 doesnt do very well with increasing the numbers of threads either. The single threaded performance is worse because there is a malloc occurring for every sample. My patch addressed that. > > Just to reiterate where im coming from - i want to show that accuracy and > performance wise ITK4 is at least no worse than ITK3. That justifies > upgrading. In the longer run, it may make more sense to rewrite old ITK > stuff to use the new metrics, but that is a separate project. My intention was to demonstrate that with the patch in gerrit the performance is about the same as ITKv3. Brad > > Rupert > -------------------------------------------------------------- > Rupert Brooks > [email protected] > > > > On Thu, Jul 26, 2012 at 11:53 AM, Bradley Lowekamp <[email protected]> > wrote: > Hello, > > Well I did get to it before you: > > http://review.source.kitware.com/#/c/6614/ > > I also uped the size of the image 100x in your test, here is the current > performance on my system: > > System: victoria.nlm.nih.gov > Processor: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz > Serial #: > Cache: 32768 > Clock: 2794.27 > Cores: 12 cpus x 24 Cores = 288 > OSName: Mac OS X > Release: 10.6.8 > Version: 10K549 > Platform: x86_64 > Operating System is 64 bit > ITK Version: 3.20.1 > Virtual Memory: Total: 256 Available: 228 > Physical Memory: Total:65536 Available: 58374 > Probe Name: Count Min Mean Stdev > Max Total > MeanSquares_1_threads 20 0.344348 0.347567 > 0.00244733 0.352629 6.95134 > MeanSquares_2_threads 20 0.251223 0.300869 > 0.0179305 0.321404 6.01738 > MeanSquares_4_threads 20 0.215516 0.348677 > 0.173645 0.678274 6.97355 > MeanSquares_8_threads 20 0.138184 0.182681 > 0.0297812 0.237129 3.65362 > System: victoria.nlm.nih.gov > Processor: > Serial #: > Cache: 32768 > Clock: 2930 > Cores: 12 cpus x 24 Cores = 288 > OSName: Mac OS X > Release: 10.6.8 > Version: 10K549 > Platform: x86_64 > Operating System is 64 bit > ITK Version: 4.2.0 > Virtual Memory: Total: 256 Available: 228 > Physical Memory: Total:65536 Available: 58371 > Probe Name: Count Min Mean Stdev > Max Total > MeanSquares_1_threads 20 0.382481 0.383342 > 0.00186954 0.391027 7.66685 > MeanSquares_2_threads 20 0.211908 0.335328 > 0.0777408 0.435574 6.70655 > MeanSquares_4_threads 20 0.271531 0.315688 > 0.0390751 0.385683 6.31377 > MeanSquares_8_threads 20 0.147544 0.192132 > 0.0299427 0.240976 3.84263 > > > In the patch provided, it is implicitly done on assignment on a per-thread > basis. What was most un-expected was when then allocation of the Jacobin was > explicitly done out side the threaded part, the time when up by 50%! I > presume that the sequential allocation, of the doubles in the master thread > made the allocation sequentially, next to each other, and may be a more > insidious form of false sharing. Below is the numbers from this run, notice > the lack of speed up with more threads: > > System: victoria.nlm.nih.gov > Processor: > Serial #: > Cache: 32768 > Clock: 2930 > Cores: 12 cpus x 24 Cores = 288 > OSName: Mac OS X > Release: 10.6.8 > Version: 10K549 > Platform: x86_64 > Operating System is 64 bit > ITK Version: 4.2.0 > Virtual Memory: Total: 256 Available: 226 > Physical Memory: Total:65536 Available: 57091 > Probe Name: Count Min Mean Stdev > Max Total > MeanSquares_1_threads 20 0.403931 0.40648 > 0.00213043 0.41389 8.1296 > MeanSquares_2_threads 20 0.243789 0.367603 > 0.0894637 0.65006 7.35206 > MeanSquares_4_threads 20 0.281336 0.354749 > 0.0431082 0.440161 7.09497 > MeanSquares_8_threads 20 0.24615 0.301576 > 0.0552998 0.446528 6.03151 > > > Brad > > > On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote: > >> Brad, >> >> The false sharing issue is a good point - however, i dont think this is the >> cause of the performance degradation. This part of the class (m_Threader, >> etc) has not changed since 3.20. (I used the optimized metrics in my 3.20 >> builds, so its in Review/itkOptMeanSquares....) It also does not explain the >> performance drop in single threaded mode. >> >> Testing will tell... Seems like a Friday afternoon project to me, unless >> someone else gets there first. >> >> Rupert >> >> -------------------------------------------------------------- >> Rupert Brooks >> [email protected] >> >> >> >> On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <[email protected]> >> wrote: >> Hello, >> >> Continuing to glance at the class.... I also see the following member >> variables for the MeanSquares class: >> >> MeasureType * m_ThreaderMSE; >> DerivativeType *m_ThreaderMSEDerivatives; >> >> Where these are index by the thread ID and access simultaneously across the >> threads causes the potential for False Sharing, which can be a MAJOR problem >> with threaded algorithms. >> >> I would think a good solution would be to create a per-thread data structure >> consisting of the Jacobin, MeasureType, and DerivativeType, plus padding to >> prevent false sharing, or equivalently assigning max data alignment to the >> structure. >> >> Rupert, Would like to take a stab at this fix? >> >> Brad >> >> >> On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote: >> >>> Sorry if this repeats - i just got a bounce from Insight Developers, so im >>> trimming the message and resending.... >>> -------------------------------------------------------------- >>> Rupert Brooks >>> [email protected] >>> >>> >>> >>> On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <[email protected]> >>> wrote: >>> Aha. Heres around line 183 of itkTranslationTransform. >>> >>> // Compute the Jacobian in one position >>> template <class TScalarType, unsigned int NDimensions> >>> void >>> TranslationTransform<TScalarType, >>> NDimensions>::ComputeJacobianWithRespectToParameters( >>> const InputPointType &, >>> JacobianType & jacobian) const >>> { >>> // the Jacobian is constant for this transform, and it has already been >>> // initialized in the constructor, so we just need to return it here. >>> jacobian = this->m_IdentityJacobian; >>> return; >>> } >>> >>> Thats probably the culprit, although the root cause may be the reallocating >>> of the jacobian every time through the loop. >>> >>> Rupert >>> >>> <snipped> >> >> > > ======================================================== > Bradley Lowekamp > Medical Science and Computing for > Office of High Performance Computing and Communications > National Library of Medicine > [email protected] > > > > ======================================================== Bradley Lowekamp Medical Science and Computing for Office of High Performance Computing and Communications National Library of Medicine [email protected]
_______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Kitware offers ITK Training Courses, for more information visit: http://kitware.com/products/protraining.php Please keep messages on-topic and check the ITK FAQ at: http://www.itk.org/Wiki/ITK_FAQ Follow this link to subscribe/unsubscribe: http://www.itk.org/mailman/listinfo/insight-developers
