[osg-users] osgPPU CUDA Example - slower than expected?
Hi, as I explained in some other mail to this list, I am currently working on a graph based image processing framework using CUDA. Basically, this is independent from OSG, but I am using OSG for my example application :-) For my first implemented postprocessing algorithm I need color and depth data. As I want the depth to be linearized between 0 and 1, I used a shader for that and also I render it in a separate pass to the color. This stuff is then fetched from the GPU to the CPU by directly attaching osg::Images to the cameras. This works perfectly, but is quite a bit slow, as you might already have suspected, because the data is also processed in CUDA kernels later, which is quite a back and forth ;-) In fact, my application with three filter kernels based on CUDA (one gauss blur with radius 21, one image subtract and one image "pseudo-add" (about as elaborate as a simple add ;-)) yields about 15 fps with a resolution of 1024 x 1024 (images for normal and absolute position information are also rendered transferred from GPU to CPU here). So with these 15 frames, I thought it should perform FAR better when avoiding that GPU <-> CPU copying stuff. That's when I came across the osgPPU-cuda example. As far as I am aware, this uses direct mapping of PixelBuferObjects to cuda memory space. This should be fast! At least that's what I thought, but running it at a resolution of 1024 x 1024 with a StatsHandler attached shows that it runs at just ~21 fps, not getting too much better when the cuda kernel execution is completely disabled. Now my question is: Is that a general (known) problem which cannot be avoided? Does it have anything to do with the memory mapping functions? How can it be optimized? I know that, while osgPPU uses older CUDA memory mapping functions, there are new ones as of CUDA 3. Is there a difference in performance? Any information on this is appreciated, because it will really help me to decide wether I should integrate buffer mapping or just keep the copying stuff going :-) Best Regards -Thorsten ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
By the way: There are two CUDA-capable devices in the computer, but I have tried using the rendering device as well as the "CUDA-only" device -> no difference! -Thorsten Am 16.12.2010 12:25, schrieb Thorsten Roth: Hi, as I explained in some other mail to this list, I am currently working on a graph based image processing framework using CUDA. Basically, this is independent from OSG, but I am using OSG for my example application :-) For my first implemented postprocessing algorithm I need color and depth data. As I want the depth to be linearized between 0 and 1, I used a shader for that and also I render it in a separate pass to the color. This stuff is then fetched from the GPU to the CPU by directly attaching osg::Images to the cameras. This works perfectly, but is quite a bit slow, as you might already have suspected, because the data is also processed in CUDA kernels later, which is quite a back and forth ;-) In fact, my application with three filter kernels based on CUDA (one gauss blur with radius 21, one image subtract and one image "pseudo-add" (about as elaborate as a simple add ;-)) yields about 15 fps with a resolution of 1024 x 1024 (images for normal and absolute position information are also rendered transferred from GPU to CPU here). So with these 15 frames, I thought it should perform FAR better when avoiding that GPU <-> CPU copying stuff. That's when I came across the osgPPU-cuda example. As far as I am aware, this uses direct mapping of PixelBuferObjects to cuda memory space. This should be fast! At least that's what I thought, but running it at a resolution of 1024 x 1024 with a StatsHandler attached shows that it runs at just ~21 fps, not getting too much better when the cuda kernel execution is completely disabled. Now my question is: Is that a general (known) problem which cannot be avoided? Does it have anything to do with the memory mapping functions? How can it be optimized? I know that, while osgPPU uses older CUDA memory mapping functions, there are new ones as of CUDA 3. Is there a difference in performance? Any information on this is appreciated, because it will really help me to decide wether I should integrate buffer mapping or just keep the copying stuff going :-) Best Regards -Thorsten ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
Ok..I correct this: There is a difference of ~1 frame ;) ...now I will stop replying to my own messages :D Am 16.12.2010 12:31, schrieb Thorsten Roth: By the way: There are two CUDA-capable devices in the computer, but I have tried using the rendering device as well as the "CUDA-only" device -> no difference! -Thorsten Am 16.12.2010 12:25, schrieb Thorsten Roth: Hi, as I explained in some other mail to this list, I am currently working on a graph based image processing framework using CUDA. Basically, this is independent from OSG, but I am using OSG for my example application :-) For my first implemented postprocessing algorithm I need color and depth data. As I want the depth to be linearized between 0 and 1, I used a shader for that and also I render it in a separate pass to the color. This stuff is then fetched from the GPU to the CPU by directly attaching osg::Images to the cameras. This works perfectly, but is quite a bit slow, as you might already have suspected, because the data is also processed in CUDA kernels later, which is quite a back and forth ;-) In fact, my application with three filter kernels based on CUDA (one gauss blur with radius 21, one image subtract and one image "pseudo-add" (about as elaborate as a simple add ;-)) yields about 15 fps with a resolution of 1024 x 1024 (images for normal and absolute position information are also rendered transferred from GPU to CPU here). So with these 15 frames, I thought it should perform FAR better when avoiding that GPU <-> CPU copying stuff. That's when I came across the osgPPU-cuda example. As far as I am aware, this uses direct mapping of PixelBuferObjects to cuda memory space. This should be fast! At least that's what I thought, but running it at a resolution of 1024 x 1024 with a StatsHandler attached shows that it runs at just ~21 fps, not getting too much better when the cuda kernel execution is completely disabled. Now my question is: Is that a general (known) problem which cannot be avoided? Does it have anything to do with the memory mapping functions? How can it be optimized? I know that, while osgPPU uses older CUDA memory mapping functions, there are new ones as of CUDA 3. Is there a difference in performance? Any information on this is appreciated, because it will really help me to decide wether I should integrate buffer mapping or just keep the copying stuff going :-) Best Regards -Thorsten ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
Hi, I don't have any other suggestions than to use a GL debugger to make sure nothing is going to CPU or to try the new CUDA functions in osgPPU or your own code. I remember something in the GL to CUDA stuff bugging me, but cannot remember the details. AFAIR something was converting from texture to PBO and then to CUDA mem. jp On 16/12/10 13:25, Thorsten Roth wrote: Hi, as I explained in some other mail to this list, I am currently working on a graph based image processing framework using CUDA. Basically, this is independent from OSG, but I am using OSG for my example application :-) For my first implemented postprocessing algorithm I need color and depth data. As I want the depth to be linearized between 0 and 1, I used a shader for that and also I render it in a separate pass to the color. This stuff is then fetched from the GPU to the CPU by directly attaching osg::Images to the cameras. This works perfectly, but is quite a bit slow, as you might already have suspected, because the data is also processed in CUDA kernels later, which is quite a back and forth ;-) In fact, my application with three filter kernels based on CUDA (one gauss blur with radius 21, one image subtract and one image "pseudo-add" (about as elaborate as a simple add ;-)) yields about 15 fps with a resolution of 1024 x 1024 (images for normal and absolute position information are also rendered transferred from GPU to CPU here). So with these 15 frames, I thought it should perform FAR better when avoiding that GPU <-> CPU copying stuff. That's when I came across the osgPPU-cuda example. As far as I am aware, this uses direct mapping of PixelBuferObjects to cuda memory space. This should be fast! At least that's what I thought, but running it at a resolution of 1024 x 1024 with a StatsHandler attached shows that it runs at just ~21 fps, not getting too much better when the cuda kernel execution is completely disabled. Now my question is: Is that a general (known) problem which cannot be avoided? Does it have anything to do with the memory mapping functions? How can it be optimized? I know that, while osgPPU uses older CUDA memory mapping functions, there are new ones as of CUDA 3. Is there a difference in performance? Any information on this is appreciated, because it will really help me to decide wether I should integrate buffer mapping or just keep the copying stuff going :-) Best Regards -Thorsten ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org -- This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard. The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html. This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. MailScanner thanks Transtec Computers for their support. ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
Hi Thorsten, the problem which you experience is because of lacking direct memory mapping between OpenGL and CUDA memory. Even if CUDA (at least it was in version 2 so) supports GPU<->GPU memory mapping, whenever you access to OpenGL textures there is a full memory copy performed. I am not aware if this was solved in CUDA3, maybe you should check it out. CUDA2 definitively doesn't perform direct mapping between GL textures and CUDA textures/arrays. regards, art Thorsten Roth wrote: > Hi, > > as I explained in some other mail to this list, I am currently working > on a graph based image processing framework using CUDA. Basically, this > is independent from OSG, but I am using OSG for my example application :-) > > For my first implemented postprocessing algorithm I need color and depth > data. As I want the depth to be linearized between 0 and 1, I used a > shader for that and also I render it in a separate pass to the color. > This stuff is then fetched from the GPU to the CPU by directly attaching > osg::Images to the cameras. This works perfectly, but is quite a bit > slow, as you might already have suspected, because the data is also > processed in CUDA kernels later, which is quite a back and forth ;-) > > In fact, my application with three filter kernels based on CUDA (one > gauss blur with radius 21, one image subtract and one image "pseudo-add" > (about as elaborate as a simple add ;-)) yields about 15 fps with a > resolution of 1024 x 1024 (images for normal and absolute position > information are also rendered transferred from GPU to CPU here). > > So with these 15 frames, I thought it should perform FAR better when > avoiding that GPU <-> CPU copying stuff. That's when I came across the > osgPPU-cuda example. As far as I am aware, this uses direct mapping of > PixelBuferObjects to cuda memory space. This should be fast! At least > that's what I thought, but running it at a resolution of 1024 x 1024 > with a StatsHandler attached shows that it runs at just ~21 fps, not > getting too much better when the cuda kernel execution is completely > disabled. > > Now my question is: Is that a general (known) problem which cannot be > avoided? Does it have anything to do with the memory mapping functions? > How can it be optimized? I know that, while osgPPU uses older CUDA > memory mapping functions, there are new ones as of CUDA 3. Is there a > difference in performance? > > Any information on this is appreciated, because it will really help me > to decide wether I should integrate buffer mapping or just keep the > copying stuff going :-) > > Best Regards > -Thorsten > ___ > osg-users mailing list > > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > > -- > Post generated by Mail2Forum -- Read this topic online here: http://forum.openscenegraph.org/viewtopic.php?p=35415#35415 ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
On 01/07/2011 03:34 PM, Art Tevs wrote: Hi Thorsten, the problem which you experience is because of lacking direct memory mapping between OpenGL and CUDA memory. Even if CUDA (at least it was in version 2 so) supports GPU<->GPU memory mapping, whenever you access to OpenGL textures there is a full memory copy performed. I am not aware if this was solved in CUDA3, maybe you should check it out. CUDA2 definitively doesn't perform direct mapping between GL textures and CUDA textures/arrays. regards, art I know that OpenCL 1.1 added a bunch of OpenGL interoperability features (clCreateFromGLBuffer(), clCreateFromGLTexture2D(), etc.), and I thought I heard that the newer versions of CUDA supported similar features. OpenGL 4.1 added some CL interop features, too. --"J" ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
Thanks for the answers. Actually I also know that there are new interoperability features in CUDA 3, but I didn't have the time to check them out yet, though if I find the time for it, I will let you know about the results :) Regards -Thorsten Am 07.01.2011 22:23, schrieb Jason Daly: On 01/07/2011 03:34 PM, Art Tevs wrote: Hi Thorsten, the problem which you experience is because of lacking direct memory mapping between OpenGL and CUDA memory. Even if CUDA (at least it was in version 2 so) supports GPU<->GPU memory mapping, whenever you access to OpenGL textures there is a full memory copy performed. I am not aware if this was solved in CUDA3, maybe you should check it out. CUDA2 definitively doesn't perform direct mapping between GL textures and CUDA textures/arrays. regards, art I know that OpenCL 1.1 added a bunch of OpenGL interoperability features (clCreateFromGLBuffer(), clCreateFromGLTexture2D(), etc.), and I thought I heard that the newer versions of CUDA supported similar features. OpenGL 4.1 added some CL interop features, too. --"J" ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] osgPPU CUDA Example - slower than expected?
Dear Mr. Art, I too am noticing a problem similar to what Mr. Thorsten pointed out. Just curious about if the openGL and CUDA going together, I downloaded the osg2.9.10 and osgCompute nodekit. I have CUDA 3.2 installed with on my machine Core2Duo with GEForce. The osgGeometryDemo sample code for warping with cow.osg is giving a reasonably high frame rate. I thought I should share this in case it is of any help. Regards Harash From: J.P. Delport To: OpenSceneGraph Users Sent: Mon, January 3, 2011 3:30:34 PM Subject: Re: [osg-users] osgPPU CUDA Example - slower than expected? Hi, I don't have any other suggestions than to use a GL debugger to make sure nothing is going to CPU or to try the new CUDA functions in osgPPU or your own code. I remember something in the GL to CUDA stuff bugging me, but cannot remember the details. AFAIR something was converting from texture to PBO and then to CUDA mem. jp On 16/12/10 13:25, Thorsten Roth wrote: > Hi, > > as I explained in some other mail to this list, I am currently working > on a graph based image processing framework using CUDA. Basically, this > is independent from OSG, but I am using OSG for my example application :-) > > For my first implemented postprocessing algorithm I need color and depth > data. As I want the depth to be linearized between 0 and 1, I used a > shader for that and also I render it in a separate pass to the color. > This stuff is then fetched from the GPU to the CPU by directly attaching > osg::Images to the cameras. This works perfectly, but is quite a bit > slow, as you might already have suspected, because the data is also > processed in CUDA kernels later, which is quite a back and forth ;-) > > In fact, my application with three filter kernels based on CUDA (one > gauss blur with radius 21, one image subtract and one image "pseudo-add" > (about as elaborate as a simple add ;-)) yields about 15 fps with a > resolution of 1024 x 1024 (images for normal and absolute position > information are also rendered transferred from GPU to CPU here). > > So with these 15 frames, I thought it should perform FAR better when > avoiding that GPU <-> CPU copying stuff. That's when I came across the > osgPPU-cuda example. As far as I am aware, this uses direct mapping of > PixelBuferObjects to cuda memory space. This should be fast! At least > that's what I thought, but running it at a resolution of 1024 x 1024 > with a StatsHandler attached shows that it runs at just ~21 fps, not > getting too much better when the cuda kernel execution is completely > disabled. > > Now my question is: Is that a general (known) problem which cannot be > avoided? Does it have anything to do with the memory mapping functions? > How can it be optimized? I know that, while osgPPU uses older CUDA > memory mapping functions, there are new ones as of CUDA 3. Is there a > difference in performance? > > Any information on this is appreciated, because it will really help me > to decide wether I should integrate buffer mapping or just keep the > copying stuff going :-) > > Best Regards > -Thorsten > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > -- This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard. The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html. This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. MailScanner thanks Transtec Computers for their support. ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org