[Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Jürgen Herrmann
Hi there, 

 

I did some tests with cuda 5.0 and 5.5 today and changed the nvcc
optimization flags for cycles_kernel_cuda.

 

I found out the following: 

 

-  “--opencc-options “ is deprecated for sm_20 and up and should be
removed from compiler options

-  Stating “-O3” and “—use_fast_math” as nvcc options brings massive
speedup on my system (more below)

-  We shouldn’t complain about new cuda toolsets that are slow, we
should find a solution as we can’t use old software forever…

 

To the speedups:

 

Example 1: 

system: i7-3820 @ 3.60GHz, GeForce GTK 660

 

Blender (cycles_cuda_kernel) compiled with standard settings:

Mike_pan file took 02:06.60 to render

 

Blender (cycles_cuda_kernel) compiled with –O3 –use-fast-math:

Mike_pan took 01:39:93

 

There is no optical difference in the render results:

 

Image1: http://www.pasteall.org/pic/52757

Image2: http://www.pasteall.org/pic/52758

 

I bet there’s more potential in there.

 

/Jürgen

___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


[Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Sergey Kurdakov
Hi all,

>As long as MinGW OpenMP isn't fixed

it is fixed in http://tdm-gcc.tdragon.net/download ( see openmp download )
and patch to any other MinGW version for gomp library can be found here

http://netcologne.dl.sourceforge.net/project/tdm-gcc/Sources/TDM%20Sources/gcc-4.7.1-tdmsrc-1.zip

not sure how useful it is, but still - there is  a patch which can be
applied.

Regards
Sergey
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Dalai Felinto
Hi Jurgen,

How does this times compare between CUDA 5.0 and 5.5?
(is this a speedup from 5.5 but a slowdown in relation with 5.0? or
it's an overall speed up ?)

--
Dalai

2013/6/3 Jürgen Herrmann :
> Hi there,
>
>
>
> I did some tests with cuda 5.0 and 5.5 today and changed the nvcc
> optimization flags for cycles_kernel_cuda.
>
>
>
> I found out the following:
>
>
>
> -  “--opencc-options “ is deprecated for sm_20 and up and should be
> removed from compiler options
>
> -  Stating “-O3” and “—use_fast_math” as nvcc options brings massive
> speedup on my system (more below)
>
> -  We shouldn’t complain about new cuda toolsets that are slow, we
> should find a solution as we can’t use old software forever…
>
>
>
> To the speedups:
>
>
>
> Example 1:
>
> system: i7-3820 @ 3.60GHz, GeForce GTK 660
>
>
>
> Blender (cycles_cuda_kernel) compiled with standard settings:
>
> Mike_pan file took 02:06.60 to render
>
>
>
> Blender (cycles_cuda_kernel) compiled with –O3 –use-fast-math:
>
> Mike_pan took 01:39:93
>
>
>
> There is no optical difference in the render results:
>
>
>
> Image1: http://www.pasteall.org/pic/52757
>
> Image2: http://www.pasteall.org/pic/52758
>
>
>
> I bet there’s more potential in there.
>
>
>
> /Jürgen
>
> ___
> Bf-committers mailing list
> Bf-committers@blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Jürgen Herrmann
Hi Dalai,

I tested 5.5 on a different system I don't have access to this machine right 
now, I'll deliver the complete benchmark results tomorrow.
I plan to compare on as many different configurations with 32 and 64 bit and 
different cuda versions.

This will take some time but I think it's worth it. ;)

/Jürgen

Am 03.06.2013 um 20:50 schrieb Dalai Felinto :

> Hi Jurgen,
> 
> How does this times compare between CUDA 5.0 and 5.5?
> (is this a speedup from 5.5 but a slowdown in relation with 5.0? or
> it's an overall speed up ?)
> 
> --
> Dalai
> 
> 2013/6/3 Jürgen Herrmann :
>> Hi there,
>> 
>> 
>> 
>> I did some tests with cuda 5.0 and 5.5 today and changed the nvcc
>> optimization flags for cycles_kernel_cuda.
>> 
>> 
>> 
>> I found out the following:
>> 
>> 
>> 
>> -  “--opencc-options “ is deprecated for sm_20 and up and should be
>> removed from compiler options
>> 
>> -  Stating “-O3” and “—use_fast_math” as nvcc options brings massive
>> speedup on my system (more below)
>> 
>> -  We shouldn’t complain about new cuda toolsets that are slow, we
>> should find a solution as we can’t use old software forever…
>> 
>> 
>> 
>> To the speedups:
>> 
>> 
>> 
>> Example 1:
>> 
>> system: i7-3820 @ 3.60GHz, GeForce GTK 660
>> 
>> 
>> 
>> Blender (cycles_cuda_kernel) compiled with standard settings:
>> 
>> Mike_pan file took 02:06.60 to render
>> 
>> 
>> 
>> Blender (cycles_cuda_kernel) compiled with –O3 –use-fast-math:
>> 
>> Mike_pan took 01:39:93
>> 
>> 
>> 
>> There is no optical difference in the render results:
>> 
>> 
>> 
>> Image1: http://www.pasteall.org/pic/52757
>> 
>> Image2: http://www.pasteall.org/pic/52758
>> 
>> 
>> 
>> I bet there’s more potential in there.
>> 
>> 
>> 
>> /Jürgen
>> 
>> ___
>> Bf-committers mailing list
>> Bf-committers@blender.org
>> http://lists.blender.org/mailman/listinfo/bf-committers
> ___
> Bf-committers mailing list
> Bf-committers@blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Brecht Van Lommel
Thanks for testing. I've also been doing some experimenting with
compile flags and other things here. So far it seems I can make my
650M render a few percentages faster compared to CUDA 4.2, but 460 GT
is still considerably slower with the BMW scene (2m30s with 5.5
compared to 2m01s with 4.2), and 580 GTX had a similar difference. It
seems you are testing with a 6xx card so that makes sense.

Patch attached for those who want to test this with 5.0/5.5.

On Mon, Jun 3, 2013 at 8:46 PM, Jürgen Herrmann  wrote:
> Hi there,
>
>
>
> I did some tests with cuda 5.0 and 5.5 today and changed the nvcc
> optimization flags for cycles_kernel_cuda.
>
>
>
> I found out the following:
>
>
>
> -  “--opencc-options “ is deprecated for sm_20 and up and should be
> removed from compiler options
>
> -  Stating “-O3” and “—use_fast_math” as nvcc options brings massive
> speedup on my system (more below)
>
> -  We shouldn’t complain about new cuda toolsets that are slow, we
> should find a solution as we can’t use old software forever…
>
>
>
> To the speedups:
>
>
>
> Example 1:
>
> system: i7-3820 @ 3.60GHz, GeForce GTK 660
>
>
>
> Blender (cycles_cuda_kernel) compiled with standard settings:
>
> Mike_pan file took 02:06.60 to render
>
>
>
> Blender (cycles_cuda_kernel) compiled with –O3 –use-fast-math:
>
> Mike_pan took 01:39:93
>
>
>
> There is no optical difference in the render results:
>
>
>
> Image1: http://www.pasteall.org/pic/52757
>
> Image2: http://www.pasteall.org/pic/52758
>
>
>
> I bet there’s more potential in there.
>
>
>
> /Jürgen
>
> ___
> Bf-committers mailing list
> Bf-committers@blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
diff --git a/intern/cycles/device/device_cuda.cpp 
b/intern/cycles/device/device_cuda.cpp
index f32c6dd..27978b9 100644
--- a/intern/cycles/device/device_cuda.cpp
+++ b/intern/cycles/device/device_cuda.cpp
@@ -46,6 +46,7 @@ public:
map tex_interp_map;
int cuDevId;
bool first_error;
+   vector cuStreams;
 
struct PixelMem {
GLuint cuPBO;
@@ -205,6 +206,12 @@ public:
if(cuda_error_(result, "cuCtxCreate"))
return;
 
+   const int num_streams = 8;
+   cuStreams.resize(num_streams);
+
+   for(int i = 0; i < num_streams; i++)
+   cuStreamCreate(&cuStreams[i], 0);
+
cuda_pop_context();
}
 
@@ -212,6 +219,9 @@ public:
{
task_pool.stop();
 
+   for(int i = 0; i < cuStreams.size(); i++)
+   cuStreamDestroy(cuStreams[i]);
+
cuda_push_context();
cuda_assert(cuCtxDetach(cuContext))
}
@@ -514,7 +524,7 @@ public:
}
}
 
-   void path_trace(RenderTile& rtile, int sample)
+   void path_trace(RenderTile& rtile, int sample, CUstream stream)
{
if(have_error())
return;
@@ -575,9 +585,9 @@ public:
 
cuda_assert(cuFuncSetCacheConfig(cuPathTrace, 
CU_FUNC_CACHE_PREFER_L1))
cuda_assert(cuFuncSetBlockShape(cuPathTrace, xthreads, 
ythreads, 1))
-   cuda_assert(cuLaunchGrid(cuPathTrace, xblocks, yblocks))
+   cuda_assert(cuLaunchGridAsync(cuPathTrace, xblocks, yblocks, 
stream))
 
-   cuda_assert(cuCtxSynchronize())
+   //cuda_assert(cuCtxSynchronize())
 
cuda_pop_context();
}
@@ -882,12 +892,35 @@ public:
void thread_run(DeviceTask *task)
{
if(task->type == DeviceTask::PATH_TRACE) {
-   RenderTile tile;
+   vector concurrent_tiles(cuStreams.size());
+   vector have_tile(cuStreams.size());

/* keep rendering tiles until done */
-   while(task->acquire_tile(this, tile)) {
-   int start_sample = tile.start_sample;
-   int end_sample = tile.start_sample + 
tile.num_samples;
+   while(1) {
+   int start_sample = -1;
+   int end_sample = -1;
+
+   for(int i = 0; i < concurrent_tiles.size(); 
i++) {
+   RenderTile& tile = concurrent_tiles[i];
+
+   if(task->acquire_tile(this, tile)) {
+   have_tile[i] = true;
+
+   if(start_sample == -1) {
+   start_sample = 
tile.start_sample;
+   end_sample = 
tile.start_sample + tile.num_samples;
+   }
+

Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Jürgen Herrmann
Hi Brecht,

you're welcome ;) 
I think I'll have to include these compiler flag optimizations into vs2012 
builds otherwise these builds will be significantly slower than other compilers 
:(
Cycles has two problems when it comes to windows/vs2012 builds:
1. It is much slower on CPU
2. It is slower with cuda, if we don't use opt flags.

As long as MinGW OpenMP isn't fixed we have to stick to vs2008/cuda 4.2 or find 
a decent solution for VS2012/Cuda 5.5.

As blender seems to be used mostly by windows users this isn't optimal.

/Jürgen

Am 03.06.2013 um 21:20 schrieb Brecht Van Lommel :

> Thanks for testing. I've also been doing some experimenting with
> compile flags and other things here. So far it seems I can make my
> 650M render a few percentages faster compared to CUDA 4.2, but 460 GT
> is still considerably slower with the BMW scene (2m30s with 5.5
> compared to 2m01s with 4.2), and 580 GTX had a similar difference. It
> seems you are testing with a 6xx card so that makes sense.
> 
> Patch attached for those who want to test this with 5.0/5.5.
> 
> On Mon, Jun 3, 2013 at 8:46 PM, Jürgen Herrmann  wrote:
>> Hi there,
>> 
>> 
>> 
>> I did some tests with cuda 5.0 and 5.5 today and changed the nvcc
>> optimization flags for cycles_kernel_cuda.
>> 
>> 
>> 
>> I found out the following:
>> 
>> 
>> 
>> -  “--opencc-options “ is deprecated for sm_20 and up and should be
>> removed from compiler options
>> 
>> -  Stating “-O3” and “—use_fast_math” as nvcc options brings massive
>> speedup on my system (more below)
>> 
>> -  We shouldn’t complain about new cuda toolsets that are slow, we
>> should find a solution as we can’t use old software forever…
>> 
>> 
>> 
>> To the speedups:
>> 
>> 
>> 
>> Example 1:
>> 
>> system: i7-3820 @ 3.60GHz, GeForce GTK 660
>> 
>> 
>> 
>> Blender (cycles_cuda_kernel) compiled with standard settings:
>> 
>> Mike_pan file took 02:06.60 to render
>> 
>> 
>> 
>> Blender (cycles_cuda_kernel) compiled with –O3 –use-fast-math:
>> 
>> Mike_pan took 01:39:93
>> 
>> 
>> 
>> There is no optical difference in the render results:
>> 
>> 
>> 
>> Image1: http://www.pasteall.org/pic/52757
>> 
>> Image2: http://www.pasteall.org/pic/52758
>> 
>> 
>> 
>> I bet there’s more potential in there.
>> 
>> 
>> 
>> /Jürgen
>> 
>> ___
>> Bf-committers mailing list
>> Bf-committers@blender.org
>> http://lists.blender.org/mailman/listinfo/bf-committers
> 
> ___
> Bf-committers mailing list
> Bf-committers@blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Thomas Dinges
Hi Jürgen,
Am 03.06.2013 20:46, schrieb Jürgen Herrmann:
> -  “--opencc-options “ is deprecated for sm_20 and up and should be
> removed from compiler options
This option should not be harmful, we just kept it in for sm_1x 
architecture. Although I am not sure if sm_1x still builds at all, we 
dropped official support for it with Blender 2.67, so probably can be 
removed.
> -  Stating “-O3” and “—use_fast_math” as nvcc options brings massive
> speedup on my system (more below)
We cannot use fast math everywhere, I remember that Brecht changed the 
code to only use fast math functions selectively, to avoid some 
precision problems. 
http://projects.blender.org/scm/viewvc.php?view=rev&root=bf-blender&revision=47133
But that could not be valid with Toolkit 5.x anymore, needs further tests.
>
> -  We shouldn’t complain about new cuda toolsets that are slow, we
> should find a solution as we can’t use old software forever…
>
True, but still it's sad that an upgrade causes all those problems. :)

Thanks for testing!

Thomas

-- 
Thomas Dinges
Blender Developer, Artist and Musician

www.dingto.org

___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Thomas Dinges
Hi Brecht,
this looks very promising. On my Geforce 540M (Windows x64, driver 
320.20) I get those render times with the BMW scene (100 Samples, 
128x128 tiles).

Vanilla Trunk (Toolkit 4.2): 2.29 minutes
Vanilla Trunk (Toolkit 5.5 RC): 3.54 minutes

Trunk + patch (Toolkit 5.5 RC): 3.00 minutes

So, that is definitely much better. :)

Best regards,
Thomas

Am 03.06.2013 21:20, schrieb Brecht Van Lommel:
> Thanks for testing. I've also been doing some experimenting with
> compile flags and other things here. So far it seems I can make my
> 650M render a few percentages faster compared to CUDA 4.2, but 460 GT
> is still considerably slower with the BMW scene (2m30s with 5.5
> compared to 2m01s with 4.2), and 580 GTX had a similar difference. It
> seems you are testing with a 6xx card so that makes sense.
>
> Patch attached for those who want to test this with 5.0/5.5.

-- 
Thomas Dinges
Blender Developer, Artist and Musician

www.dingto.org

___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Yousef Hurfoush
hi

> it is fixed in http://tdm-gcc.tdragon.net/download ( see openmp download )
> and patch to any other MinGW version for gomp library can be found here
> 
> http://netcologne.dl.sourceforge.net/project/tdm-gcc/Sources/TDM%20Sources/gcc-4.7.1-tdmsrc-1.zip
> 

can Antony maybe check this and enable openmp correctly and solves the mingw64 
problem, this could be the answer guys!




Regards
Yousef Harfoush
ba...@msn.com

  
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Yousef Hurfoush
hi again

i had time so i did some tests with a simple smoke scene:

gcc 4.7.1 wingw64 with openmp NOT compiled took 1:46 s
gcc 4.7.1 wingw64 with openmp COMPILED crashes
gcc 4.7.1 wingw64 with replacing the fixed dlls from the site and openmp 
COMPILED took 0:25 s

it seems the problem has been fixed :)

i hope some one DO a full blender regression test.

Regards
Yousef Harfoush
ba...@msn.com


  
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers


Re: [Bf-committers] massive cuda speed improvements with Cuda 5.0/5.5

2013-06-03 Thread Jürgen Herrmann
Sounds promising!
If that really works we could just drop the whole MSVC stuff and build with 
mingw ;)


Am 04.06.2013 um 04:09 schrieb Yousef Hurfoush :

> hi again
> 
> i had time so i did some tests with a simple smoke scene:
> 
> gcc 4.7.1 wingw64 with openmp NOT compiled took 1:46 s
> gcc 4.7.1 wingw64 with openmp COMPILED crashes
> gcc 4.7.1 wingw64 with replacing the fixed dlls from the site and openmp 
> COMPILED took 0:25 s
> 
> it seems the problem has been fixed :)
> 
> i hope some one DO a full blender regression test.
> 
> Regards
> Yousef Harfoush
> ba...@msn.com
> 
>   
> ___
> Bf-committers mailing list
> Bf-committers@blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
___
Bf-committers mailing list
Bf-committers@blender.org
http://lists.blender.org/mailman/listinfo/bf-committers