I modified the the CUDA renderer to take environment variable overrides 
to allow a performance test to iterate through several combinations of 

The code has many performance optimizations that I have made to 
eliminate redundant memory allocations, using all asynchronous 
operations (all memory copies and kernel launches), using two streams 
fed from a single thread (interleaved 1:1 between them) each processing 
a separate tile, and using a RenderBuffers pool to eliminate those 
allocations and frees.

Surprisingly, the largest register count with the smallest warp size had 
the best time! Best time has 33% occupancy.

Tested on SM 2.0 hardware (single GTX 580) rendered from command line. 
Display driver is nvidia-experimental-310. Scene is the BMW scene with 
the default size (1280x600), percent at 75% (result is 960x450), tile 
size at the default 64x64. Using CUDA 5.0 toolkit on Linux Mint 14 Mate 
64-bit.Linux kernel is 3.5.0-28-generic. CPU is 6-core Core I7 990x 
Extreme Edition 12MB L3. PCIe 16x bus. 20GB 1066 RAM in triple channel 
configuration. Bus interface is 16x PCIe.

*registers*     *warp size*     *optimization*  *time (s)
63      64      3       70.80
63      64      2       70.81
63      64      1       70.87
63      64      0       70.96
63      64      4       71.03
42      64      3       72.10
42      64      0       72.17
42      64      4       72.21
32      64      0       72.44
42      64      2       72.47
32      64      4       72.49
63      256     4       72.61
63      256     3       72.65
63      256     1       72.70
32      64      2       72.77
42      64      1       72.78
32      64      1       72.82
63      256     2       73.09
63      256     0       73.27
32      64      3       73.55
42      256     3       74.16
42      256     0       74.26
42      256     2       74.34
42      256     4       74.39
42      256     1       74.42
32      256     2       74.57
32      256     4       74.57
32      256     0       74.60
24      64      3       74.62
32      256     3       74.70
24      64      1       74.75
24      64      2       75.09
32      256     1       75.15
24      64      4       75.30
24      64      0       75.39
24      256     3       76.90
20      64      1       76.92
20      64      2       77.12
20      64      0       77.14
20      64      3       77.14
24      256     1       77.42
24      256     2       77.64
20      64      4       77.65
24      256     0       78.31
20      256     1       79.21
20      256     4       79.23
20      256     2       79.43
*20*    *256*   *3*     *79.75*
24      256     4       79.81
20      256     0       79.83
32      1024    2       100.93
32      1024    3       100.95
32      1024    0       101.25
32      1024    4       101.28
32      1024    1       101.47
24      1024    2       105.52
24      1024    0       105.62
24      1024    4       106.00
24      1024    3       106.39
24      1024    1       106.53
20      1024    1       111.51
20      1024    2       111.61
20      1024    3       111.69
20      1024    0       111.74
20      1024    4       111.84

The following combinations fail due to insufficient resources:

/42/    /1024/  /2/     /failed/
/63/    /1024/  /4/     /failed/
/42/    /1024/  /4/     /failed/
/63/    /1024/  /1/     /failed/
/63/    /1024/  /2/     /failed/
/42/    /1024/  /0/     /failed/
/42/    /1024/  /1/     /failed/
/63/    /1024/  /0/     /failed/
/42/    /1024/  /3/     /failed/
/63/    /1024/  /3/     /failed/

I am going to re-run the analysis with even smaller warp sizes to see 
what happens.


Bf-committers mailing list

Reply via email to