I modified the the CUDA renderer to take environment variable overrides to allow a performance test to iterate through several combinations of parameters.
The code has many performance optimizations that I have made to eliminate redundant memory allocations, using all asynchronous operations (all memory copies and kernel launches), using two streams fed from a single thread (interleaved 1:1 between them) each processing a separate tile, and using a RenderBuffers pool to eliminate those allocations and frees. Surprisingly, the largest register count with the smallest warp size had the best time! Best time has 33% occupancy. Tested on SM 2.0 hardware (single GTX 580) rendered from command line. Display driver is nvidia-experimental-310. Scene is the BMW scene with the default size (1280x600), percent at 75% (result is 960x450), tile size at the default 64x64. Using CUDA 5.0 toolkit on Linux Mint 14 Mate 64-bit.Linux kernel is 3.5.0-28-generic. CPU is 6-core Core I7 990x Extreme Edition 12MB L3. PCIe 16x bus. 20GB 1066 RAM in triple channel configuration. Bus interface is 16x PCIe. *registers* *warp size* *optimization* *time (s) * 63 64 3 70.80 63 64 2 70.81 63 64 1 70.87 63 64 0 70.96 63 64 4 71.03 42 64 3 72.10 42 64 0 72.17 42 64 4 72.21 32 64 0 72.44 42 64 2 72.47 32 64 4 72.49 63 256 4 72.61 63 256 3 72.65 63 256 1 72.70 32 64 2 72.77 42 64 1 72.78 32 64 1 72.82 63 256 2 73.09 63 256 0 73.27 32 64 3 73.55 42 256 3 74.16 42 256 0 74.26 42 256 2 74.34 42 256 4 74.39 42 256 1 74.42 32 256 2 74.57 32 256 4 74.57 32 256 0 74.60 24 64 3 74.62 32 256 3 74.70 24 64 1 74.75 24 64 2 75.09 32 256 1 75.15 24 64 4 75.30 24 64 0 75.39 24 256 3 76.90 20 64 1 76.92 20 64 2 77.12 20 64 0 77.14 20 64 3 77.14 24 256 1 77.42 24 256 2 77.64 20 64 4 77.65 24 256 0 78.31 20 256 1 79.21 20 256 4 79.23 20 256 2 79.43 *20* *256* *3* *79.75* 24 256 4 79.81 20 256 0 79.83 32 1024 2 100.93 32 1024 3 100.95 32 1024 0 101.25 32 1024 4 101.28 32 1024 1 101.47 24 1024 2 105.52 24 1024 0 105.62 24 1024 4 106.00 24 1024 3 106.39 24 1024 1 106.53 20 1024 1 111.51 20 1024 2 111.61 20 1024 3 111.69 20 1024 0 111.74 20 1024 4 111.84 The following combinations fail due to insufficient resources: /42/ /1024/ /2/ /failed/ /63/ /1024/ /4/ /failed/ /42/ /1024/ /4/ /failed/ /63/ /1024/ /1/ /failed/ /63/ /1024/ /2/ /failed/ /42/ /1024/ /0/ /failed/ /42/ /1024/ /1/ /failed/ /63/ /1024/ /0/ /failed/ /42/ /1024/ /3/ /failed/ /63/ /1024/ /3/ /failed/ I am going to re-run the analysis with even smaller warp sizes to see what happens. -Doug _______________________________________________ Bf-committers mailing list Bf-committers@blender.org http://lists.blender.org/mailman/listinfo/bf-committers