Hi Simon, thanks for the clarification. Just to understand what’s going on here: The issue is that CudaFDKConeBeamReconstructionFilter inherits properties from its superclass FDKConeBeamReconstructionFilter and not explicitly setting the value SLAB_SIZE? In GPUGenerateData()<https://github.com/SimonRit/RTK/blob/601715281db1ebb7a8a7e436bb0608814ee30e91/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L39> the CPUSuperclass:GenerateData()<https://github.com/SimonRit/RTK/blob/601715281db1ebb7a8a7e436bb0608814ee30e91/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L41> method is called that uses m_ProjectionSubsetSize<https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L135> to group projections, which is now defaulted to 2 as per this part<https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L49-L56>. So a simple CPUSuperclass::SetProjectionSubsetSize(SLAB_SIZE); In the constructor of CudaFDKConeBeamReconstructionFilter should be enough? Or am I approaching the issue from the wrong side and mixing up things?
Best, Moritz Von: Simon Rit <simon....@creatis.insa-lyon.fr> Gesendet: Freitag, 19. November 2021 07:25 An: Moritz Schaar <sch...@imt.uni-luebeck.de> Cc: rtk-users@public.kitware.com Betreff: Re: [Rtk-users] Slow CUDA FDK performance Hi, Thanks for isolating the problem. It's interesting to note that the comment<https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48> disagree with the content<https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L49-L56>... This commit<https://github.com/SimonRit/RTK/commit/42bd0e2557a04e1bea5e6a25f63d251a12743fc3> is not sufficiently clear to understand why it's been changed at the time (and I don't know if it's still necessary). This could probably be changed in the Cuda constructor<https://github.com/SimonRit/RTK/blob/master/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L21>. Note that the computation of the ramp kernel always uses the CPU filter https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFFTRampImageFilter.hxx#L84 but this should not be a problem as only one line is computed anyway (before being replicated). Feel free to propose a pull request with this change if it does solve your issue. As far as I can see, RTK_CUDA_PROJECTIONS_SLAB_SIZE (which becomes SLAB_SIZE) is only used in the Cuda code of the forward and backprojections, not for setting the FDK processing, so you have to set it manually. Simon On Thu, Nov 18, 2021 at 3:10 PM Moritz Schaar <sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> wrote: Hi Simon, thank you for looking into it! So there is no general issue, which is nice to know. Sadly I cannot run the old version as I end up with the following error: “ImportError: DLL load failed while importing _RTKPython” ITK works, RTK doesn’t. And rebuilding also doesn’t work as too many things changed (CUDA, VS, ..). In the meantime I figured out that modifying “m_ProjectionSubsetSize” helps to accelerate everything. Checking this with “feldkamp.GetProjectionSubsetSize()” I get for the CPU and CUDA version of the FDKConeBeamReconstructionFilter a value of “2”. Regarding your test code I, obviously, get very slow execution times with this value. Increasing the number of subsets to 200 I end up with similar values you reported. Now I am wondering where these defaults come from. For CPU I assume that this comes from missing FFTW as given here https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48 However, I do not understand why the CUDA version defaults to 2 instead of 16. In CMake I kept the default: RTK_CUDA_PROJECTIONS_SLAB_SIZE=16 From https://github.com/SimonRit/RTK/blob/master/rtkConfiguration.h.in#L35 I assume that this gets copied to SLAB_SIZE which will be used by all CUDA codes. My rtkConfiguration.h also reflects this: #ifndef SLAB_SIZE # define SLAB_SIZE 16 #endif Do you have an explanation why this gets reduced from 16 to 2? Or a hint where I can have a look? Best, Moritz Von: Simon Rit <simon....@creatis.insa-lyon.fr<mailto:simon....@creatis.insa-lyon.fr>> Gesendet: Donnerstag, 18. November 2021 12:13 An: Moritz Schaar <sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> Cc: rtk-users@public.kitware.com<mailto:rtk-users@public.kitware.com> Betreff: Re: [Rtk-users] Slow CUDA FDK performance Hi, I compiled the python packages with exactly the same configurations and I can't reproduce the issue old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0 -> 0.9 s 0.019904613494873047 0.6475656032562256 Reconstructing... 0.9730124473571777 new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0 0.017342329025268555 0.7650339603424072 Reconstructing... 0.8823671340942383 The code I ran is the following #!/usr/bin/env python import sys import itk import time from itk import RTK as rtk if len ( sys.argv ) < 3: print( "Usage: FirstReconstruction <outputimage> <outputgeometry>" ) sys.exit ( 1 ) # Defines the image type GPUImageType = rtk.CudaImage[itk.F,3] CPUImageType = rtk.Image[itk.F,3] # Defines the RTK geometry object geometry = rtk.ThreeDCircularProjectionGeometry.New() numberOfProjections = 200 firstAngle = 0. angularArc = 360. sid = 600 # source to isocenter distance sdd = 1200 # source to detector distance for x in range(0,numberOfProjections): angle = firstAngle + x * angularArc / numberOfProjections geometry.AddProjection(sid,sdd,angle) # Writing the geometry to disk xmlWriter = rtk.ThreeDCircularProjectionGeometryXMLFileWriter.New() xmlWriter.SetFilename ( sys.argv[2] ) xmlWriter.SetObject ( geometry ); xmlWriter.WriteFile(); # Create a stack of empty projection images ConstantImageSourceType = rtk.ConstantImageSource[GPUImageType] constantImageSource = ConstantImageSourceType.New() origin = [ -127.75, -127.75, 0. ] sizeOutput = [ 512, 512, numberOfProjections ] spacing = [ 0.5, 0.5, 0.5 ] constantImageSource.SetOrigin( origin ) constantImageSource.SetSpacing( spacing ) constantImageSource.SetSize( sizeOutput ) constantImageSource.SetConstant(0.) REIType = rtk.RayEllipsoidIntersectionImageFilter[CPUImageType, CPUImageType] rei = REIType.New() semiprincipalaxis = [ 50, 50, 50] center = [ 0, 0, 10] # Set GrayScale value, axes, center... rei.SetDensity(2) rei.SetAngle(0) rei.SetCenter(center) rei.SetAxis(semiprincipalaxis) rei.SetGeometry( geometry ) rei.SetInput(constantImageSource.GetOutput()) # Create reconstructed image constantImageSource2 = ConstantImageSourceType.New() sizeOutput = [ 256 ] * 3 origin = [ -63.75 ] * 3 spacing = [ 0.5 ] * 3 constantImageSource2.SetOrigin( origin ) constantImageSource2.SetSpacing( spacing ) constantImageSource2.SetSize( sizeOutput ) constantImageSource2.SetConstant(0.) t0 = time.time() constantImageSource2.Update() t1 = time.time() print(t1-t0) # Graft the projections to an itk::CudaImage projections = GPUImageType.New() t0 = time.time() rei.Update() t1 = time.time() print(t1-t0) projections.SetPixelContainer(rei.GetOutput().GetPixelContainer()) projections.CopyInformation(rei.GetOutput()) projections.SetBufferedRegion(rei.GetOutput().GetBufferedRegion()) projections.SetRequestedRegion(rei.GetOutput().GetRequestedRegion()) # FDK reconstruction print("Reconstructing...") FDKGPUType = rtk.CudaFDKConeBeamReconstructionFilter feldkamp = FDKGPUType.New() feldkamp.SetInput(0, constantImageSource2.GetOutput()) feldkamp.SetInput(1, projections) feldkamp.SetGeometry(geometry) feldkamp.GetRampFilter().SetTruncationCorrection(0.0) feldkamp.GetRampFilter().SetHannCutFrequency(0.0) t0 = time.time() feldkamp.Update() t1 = time.time() print(t1-t0) To be honest I don't see to do at this stage... Can you maybe check the same code with your two versions ? Any other suggestion? Simon On Wed, Nov 10, 2021 at 10:03 AM Moritz Schaar <sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> wrote: Hi Simon, I completely agree that this is hard to track down. That’s why I am asking for directions ☺ To be more precise about the execution times of my example: The timings given in pairs of 17.1/1.2 s and 19/7 s are only the required times of the reconstruction step itself. Reading data, pre and post processing are not part of this time measurement. So the 7 s average in python is similar to the 6.41 s I obtained from adding everything done in CudaFDKConeBeamReconstructionFilter using RTK_PROBE_EACH_FILTER. The reconstruction step in python simply involves: - Instantiation of a simple class, this doesn’t add anything to the timings - Setting up ConstantImageSource with either rtk.Image or rtk.CudaImage - Setting up FDKConeBeamReconstructionFilter/CudaFDKConeBeamReconstructionFilter - Setting inputs, geometry and filter - Update() and return result Looks like there was a typo in my mail, the versions compared should be: old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0 new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0 Sorry for the confusion and thanks for looking into it! Best, Moritz Von: Simon Rit <simon....@creatis.insa-lyon.fr<mailto:simon....@creatis.insa-lyon.fr>> Gesendet: Mittwoch, 10. November 2021 09:32 An: Moritz Schaar <sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> Cc: rtk-users@public.kitware.com<mailto:rtk-users@public.kitware.com> Betreff: Re: [Rtk-users] Slow CUDA FDK performance Hi Moritz, Thanks for the report. It's a bit hard to be convinced that something is wrong without being able to reproduce it. From the RTK_PROBE_EACH_FILTER log, most of the time is spent reading the projections which will be the same with or without cuda so I wonder if this is not the issue here. I can try to reproduce the issue, can you just confirm the two configurations : Cuda 10.2, ITK 5.2.1, RTK 2.1.0 vs Cuda 11.5, ITK 5.2.1 RTK 2.3.0 ? Thanks, Simon On Fri, Nov 5, 2021 at 4:20 PM Moritz Schaar <sch...@imt.uni-luebeck.de<mailto:sch...@imt.uni-luebeck.de>> wrote: Hi, I recently upgraded my Windows 10 system to ITK 5.2.1 including RTK 2.3.0. This also involved upgrading CUDA from 10.2 to 11.5, Visual Studio 2019 and even python update (3.8.5 to 3.8.12). Using the python wrapping of RTK I implemented own routines that use FDK similar to the rtkfdk application. On the old system (ITK 5.2.1, RTK 2.1.0) I benchmarked the FDK for a 512x512x200 dataset reconstructed into 256x256x256 with 1.0 mm isotropic voxel size. The system is equipped with 24 CPU cores and one RTX 2080 Ti, so the CPU version took 17.1 and the CUDA version 1.2 seconds. Running the new software version on the same system results in roughly 19 s CPU time but more than 7 s for the CUDA version. I don’t care about the actual timings but the relative increase of the CUDA version is what bothers me. To dig up some more information I recompiled RTK with RTK_PROBE_EACH_FILTER and ran rtkfdk.exe for the same data, this is what I got: ************************************************************************************************************** Probe Tag Starts Stops Time (s) Memory (kB) Cuda memory (kB) ************************************************************************************************************** ChangeInformationImageFilter 200 200 0.0211846 0 0 ConstantImageSource 1 1 0.0305991 65668 0 CudaCropImageFilter 13 13 0.0222911 15786.8 15753.8 CudaDisplacedDetectorImageFilter 13 13 0.0540568 10719.1 16384 CudaFDKBackProjectionImageFilter 13 13 0.0326397 5051.38 5041.23 CudaFDKConeBeamReconstructionFilter 1 1 5.72999 552184 211648 CudaFDKWeightProjectionFilter 13 13 0.0262806 -13892 630.154 CudaFFTRampImageFilter 13 13 0.148416 43095.4 12499.7 CudaParkerShortScanImageFilter 13 13 0.0467202 2525.85 15753.8 ExtractImageFilter 13 13 0.0259726 15812.3 -15753.8 ImageFileReader 200 200 0.0226735 -0.16 0 ImageSeriesReader 200 200 0.066097 6.12 0 ProjectionsReader 1 1 26.0388 208488 0 StreamingImageFilter 2 2 16.0663 547512 191840 VnlRealToHalfHermitianForwardFFTImageFilter 2 2 0.0208174 0 0 Following the conversion on the mailing list, https://public.kitware.com/pipermail/rtk-users/2018-July/010617.html, I see that the CudaFDKConeBeamReconstructionFilter takes 6.41 s of which roughly 1/3 is spent in the CudaFFTRampImageFilter. Sadly I don’t have these results for the old software version so I can’t relate these values. However, I also played around with v2.2.0 but it doesn’t make a difference. Sadly, the version I used before (v2.1.0) won’t compile with CUDA 11.5 anymore. I tried to add small adjustments e.g. this commit https://github.com/SimonRit/RTK/commit/3d3c7506087f5fa98aee75df5af5c30e7e51cbe6 to make things work but this didn’t work. The same happens with other errors when trying to setup ITK 5.1.2, so getting back the old version for comparison seems impossible. Is there any direction you can point me to check what is actually the issue here? Or maybe someone has an idea what could be the reason? CUDA/RTK/ITK version? Any help is appreciated. Best, Moritz _______________________________________________ Rtk-users mailing list Rtk-users@public.kitware.com<mailto:Rtk-users@public.kitware.com> https://public.kitware.com/mailman/listinfo/rtk-users
_______________________________________________ Rtk-users mailing list Rtk-users@public.kitware.com https://public.kitware.com/mailman/listinfo/rtk-users