Re: std.range.zip performance
On 02/16/2011 03:36 AM, Andrei Alexandrescu wrote: Initial: 58 seconds. Eliminated the switch in popFront: 53s. Replaced emplace with assignment: 23s. Specialized emplace for non-struct types, reinserted: 23s. Eliminated the loop in empty (replaced with return ranges[0].empty;): 17s. I'm sure there are ways to further improve this, but there are a few difficulties. Each pass through the loop the code must transport values from the two arrays into a specific format and then distribute them for further calculation. Then, upon each popFront two words must be touched because arrays hold pointer+length, not pointer+pointer (as probably would be better since ranges are around). Nice analysis! Bearophile, is clay's zip func lazy. Else, it could explain part of its performance, don't you think? Denis
Re: std.range.zip performance
Am 16.02.2011 10:25, schrieb spir: On 02/16/2011 03:36 AM, Andrei Alexandrescu wrote: Initial: 58 seconds. Eliminated the switch in popFront: 53s. Replaced emplace with assignment: 23s. Specialized emplace for non-struct types, reinserted: 23s. Eliminated the loop in empty (replaced with return ranges[0].empty;): 17s. I'm sure there are ways to further improve this, but there are a few difficulties. Each pass through the loop the code must transport values from the two arrays into a specific format and then distribute them for further calculation. Then, upon each popFront two words must be touched because arrays hold pointer+length, not pointer+pointer (as probably would be better since ranges are around). Nice analysis! Bearophile, is clay's zip func lazy. Else, it could explain part of its performance, don't you think? Denis why should it - what can lazyness help here?
Re: std.range.zip performance
Andrei: I'm sure there are ways to further improve this, but there are a few difficulties. Each pass through the loop the code must transport values from the two arrays into a specific format and then distribute them for further calculation. Then, upon each popFront two words must be touched because arrays hold pointer+length, not pointer+pointer (as probably would be better since ranges are around). HOFs as zip/map/filter aren't D built-ins as in Python, but they are basic language constructs. If the D front-end becomes a bit aware of what a zip is, it probably becomes able to optimize away most or all those data movements. The performance difference between the #3, #4, #5 and #5b (without zip constructor) shows there's more space for improvements, optimization-wise. In the Haskell Prelude (a standard module imported and compiled before any user code) shows the implementation of zip(l1,l2) and zip3(l1,l2,l3): zip :: [a] - [b] - [(a,b)] zip = zipWith (,) zip3 :: [a] - [b] - [c] - [(a,b,c)] zip3 = zipWith3 (,,) zipWith :: (a-b-c) - [a]-[b]-[c] zipWith z (a:as) (b:bs) = z a b : zipWith z as bs zipWith _ _ _= [] zipWith3 :: (a-b-c-d) - [a]-[b]-[c]-[d] zipWith3 z (a:as) (b:bs) (c:cs) = z a b c : zipWith3 z as bs cs zipWith3 _ _ _ _ = [] They are tiny compared to Phobos code. They are efficient enough, and they have no corner cases. -- spir: is clay's zip func lazy. It seems lazy, it's not an array of records, and the printing function refuses to print it. Bye, bearophile
Re: std.range.zip performance
I have done one more test, with a much simpler zip(). The code is now faster, but it seems there's no hope to have an acceptable performance. Maybe D needs a built-in zip as Clay. Bye, bearophile --- Timings, seconds, best of 3: #1: 1.95 #2: 61.50 #3: 2.25 #4: 1.52 #5: 15.50 Note: removing the this() the program #5 gets about 2-3 seconds faster. --- // program #5 (D2) import std.c.stdio: printf; struct zip { int[] arr1, arr2; size_t i; static struct Element { int _0, _1; } this(int[] a1, int[] a2) in { assert(a1.length == a2.length); } body { arr1 = a1; arr2 = a2; } bool empty() { return i = arr1.length; } @property Element front() { return Element(arr1[i], arr2[i]); } void popFront() { i++; } } void main() { auto a = [18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28]; auto b = [9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25]; int tot = 0; foreach (i; 0 .. 50_000_000) foreach (xy; zip(a, b)) tot += xy._0 + xy._1; printf(%d\n, tot); } -- Just loops program #5: LCE:mov 024h[ESP],EBX lea EBX,054h[ESP] xor EDX,EDX mov [EBX],EDX mov EAX,014h[ESP] lea ESI,054h[ESP] mov 4[EBX],EDX lea EDI,034h[ESP] mov 8[EBX],EDX mov 0Ch[EBX],EDX mov 010h[EBX],EDX mov EDX,018h[ESP] mov 058h[ESP],EDX mov EDX,020h[ESP] mov 054h[ESP],EAX mov EAX,01Ch[ESP] mov 05Ch[ESP],EAX mov 060h[ESP],EDX movsd movsd movsd movsd movsd mov ECX,044h[ESP] cmp ECX,034h[ESP] jae L167 L11D: mov EBX,044h[ESP] mov EDX,038h[ESP] mov ESI,[EBX*4][EDX] mov EAX,034h[ESP] mov EDX,040h[ESP] mov ECX,[EBX*4][EDX] mov 078h[ESP],ECX mov EAX,03Ch[ESP] mov EDX,078h[ESP] mov 074h[ESP],ESI mov EAX,074h[ESP] mov EBX,074h[ESP] mov 070h[ESP],EDX add EBX,070h[ESP] add EBP,EBX inc dword ptr 044h[ESP] mov ESI,044h[ESP] mov 06Ch[ESP],EAX cmp ESI,034h[ESP] jb L11D L167: mov EBX,024h[ESP] inc EBX cmp EBX,02FAF080h jb LCE --
Re: std.range.zip performance
Initial: 58 seconds. Eliminated the switch in popFront: 53s. Replaced emplace with assignment: 23s. Specialized emplace for non-struct types, reinserted: 23s. Eliminated the loop in empty (replaced with return ranges[0].empty;): 17s. I'm sure there are ways to further improve this, but there are a few difficulties. Each pass through the loop the code must transport values from the two arrays into a specific format and then distribute them for further calculation. Then, upon each popFront two words must be touched because arrays hold pointer+length, not pointer+pointer (as probably would be better since ranges are around). Nice analysis! Andrei On 2/15/11 6:24 PM, bearophile wrote: While doing some benchmarks on the Clay language I've found something interesting. I have timed a zip() done on two short arrays, and I have found this D version too much slow compared to normal array code. Higher order functions like zip, map, filter, etc are going to stay, they aren't toys, so I think the D compiler has to digest them better. (There is lot of of asm at the tail of this post, I don't know if some of t will get cut away.) Bye, bearophile Timings, seconds, best of 3: #1: 1.95 #2: 61.50 #3: 2.25 #4: 1.52 -- Binary sizes, bytes: #1: 33_016 #2: 284_188 #3: 103_964 #4: 103_964 -- Compilers and command line used: DMD 2.051 clay compiler (hg id dec41ba07d58 tip, llvm r122866, Jan 5 2011) clay -inline test1.clay -o test1.exe dmd -O -release -inline test2.d dmd -O -release -inline test3.d To produce asm code with Clay: clay -inline -asm test1.clay -o test1.s Output: -1149672960 for all versions. -- Code used: // program #1 (Clay) main() { var a = Vector([18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28]); var b = Vector([9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25]); var tot = 0; // 32 bit signed int for (i in range(0, 50*1000*1000)) for (x, y in zipped(a, b)) tot += x + y; println(tot); } -- // program #2 (D2) import std.c.stdio: printf; import std.range: zip; void main() { auto a = [18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28]; auto b = [9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25]; int tot = 0; foreach (i; 0 .. 50_000_000) foreach (xy; zip(a, b)) tot += xy[0] + xy[1]; printf(%d\n, tot); } -- // program #3 (D2) import std.c.stdio: printf; void main() { auto a = [18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28]; auto b = [9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25]; int tot = 0; foreach (i; 0 .. 50_000_000) foreach (j; 0 .. a.length) tot += a[j] + b[j]; printf(%d\n, tot); } -- // program #4 (D2) import std.c.stdio: printf; void main() { auto a = [18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28]; auto b = [9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25]; int tot = 0; foreach (i; 0 .. 50_000_000) foreach (j; 0 .. 30) tot += a[j] + b[j]; printf(%d\n, tot); } -- Just loops program #1: movl$5000, %eax # imm = 0x2FAF080 .align16, 0x90 LBB2_4: # %bb.nph.i.i # =This Loop Header: Depth=1 # Child Loop BB2_5 Depth 2 xorl%edx, %edx .align16, 0x90 LBB2_5: # %return410.i.i # Parent Loop BB2_4 Depth=1 # = This Inner Loop Header: Depth=2 addl(%esi,%edx,4), %ecx addl(%edi,%edx,4), %ecx incl%edx cmpl$30, %edx jneLBB2_5 # BB#3: # %return272.loopexit.i.i # in Loop: Header=BB2_4 Depth=1 decl%eax jneLBB2_4 -- Just loops program #2: LCE:pushdword ptr 018h[ESP] movESI,offset FLAT:_D3std5range14__T3ZipTAiTAiZ3Zip6__initZ pushdword ptr 018h[ESP] pushdword ptr 028h[ESP] pushdword ptr 028h[ESP] push0 leaEDI,078h[ESP] movsd movsd movsd movsd movsd leaEAX,078h[ESP] callnear ptr _D3std5range14__T3ZipTAiTAiZ3Zip6__ctorMFNcAiAiE3std5range14StoppingPolicyZS3std5range14__T3ZipTAiTAiZ3Zip movESI,EAX leaEDI,044h[ESP] movsd movsd movsd