Re: std.range.zip performance

2011-02-16 Thread spir

On 02/16/2011 03:36 AM, Andrei Alexandrescu wrote:

Initial: 58 seconds.

Eliminated the switch in popFront: 53s.

Replaced emplace with assignment: 23s.

Specialized emplace for non-struct types, reinserted: 23s.

Eliminated the loop in empty (replaced with return ranges[0].empty;): 17s.

I'm sure there are ways to further improve this, but there are a few
difficulties. Each pass through the loop the code must transport values from
the two arrays into a specific format and then distribute them for further
calculation. Then, upon each popFront two words must be touched because arrays
hold pointer+length, not pointer+pointer (as probably would be better since
ranges are around).


Nice analysis!


Bearophile, is clay's zip func lazy. Else, it could explain part of its 
performance, don't you think?


Denis


Re: std.range.zip performance

2011-02-16 Thread dennis luehring

Am 16.02.2011 10:25, schrieb spir:

On 02/16/2011 03:36 AM, Andrei Alexandrescu wrote:

 Initial: 58 seconds.

 Eliminated the switch in popFront: 53s.

 Replaced emplace with assignment: 23s.

 Specialized emplace for non-struct types, reinserted: 23s.

 Eliminated the loop in empty (replaced with return ranges[0].empty;): 17s.

 I'm sure there are ways to further improve this, but there are a few
 difficulties. Each pass through the loop the code must transport values from
 the two arrays into a specific format and then distribute them for further
 calculation. Then, upon each popFront two words must be touched because arrays
 hold pointer+length, not pointer+pointer (as probably would be better since
 ranges are around).


 Nice analysis!


Bearophile, is clay's zip func lazy. Else, it could explain part of its
performance, don't you think?

Denis


why should it - what can lazyness help here?


Re: std.range.zip performance

2011-02-16 Thread bearophile
Andrei:

 I'm sure there are ways to further improve this, but there are a few 
 difficulties. Each pass through the loop the code must transport values from 
 the two arrays into a specific format and then distribute them for further 
 calculation. Then, upon each popFront two words must be touched because 
 arrays hold pointer+length, not pointer+pointer (as probably would be better 
 since ranges are around).

HOFs as zip/map/filter aren't D built-ins as in Python, but they are basic 
language constructs. If the D front-end becomes a bit aware of what a zip is, 
it probably becomes able to optimize away most or all those data movements.

The performance difference between the #3, #4, #5 and #5b (without zip 
constructor) shows there's more space for improvements, optimization-wise.

In the Haskell Prelude (a standard module imported and compiled before any user 
code) shows the implementation of zip(l1,l2) and zip3(l1,l2,l3):


zip  :: [a] - [b] - [(a,b)]
zip  =  zipWith (,)

zip3 :: [a] - [b] - [c] - [(a,b,c)]
zip3 =  zipWith3 (,,)

zipWith  :: (a-b-c) - [a]-[b]-[c]
zipWith z (a:as) (b:bs)
 =  z a b : zipWith z as bs
zipWith _ _ _=  []

zipWith3 :: (a-b-c-d) - [a]-[b]-[c]-[d]
zipWith3 z (a:as) (b:bs) (c:cs)
 =  z a b c : zipWith3 z as bs cs
zipWith3 _ _ _ _ =  []


They are tiny compared to Phobos code. They are efficient enough, and they have 
no corner cases.

--

spir:

is clay's zip func lazy.

It seems lazy, it's not an array of records, and the printing function refuses 
to print it.

Bye,
bearophile


Re: std.range.zip performance

2011-02-15 Thread bearophile
I have done one more test, with a much simpler zip(). The code is now faster, 
but it seems there's no hope to have an acceptable performance. Maybe D needs a 
built-in zip as Clay.

Bye,
bearophile

---

Timings, seconds, best of 3:
  #1:  1.95
  #2: 61.50
  #3:  2.25
  #4:  1.52
  #5: 15.50

Note: removing the this() the program #5 gets about 2-3 seconds faster.

---

// program #5 (D2)
import std.c.stdio: printf;

struct zip {
int[] arr1, arr2;
size_t i;

static struct Element { int _0, _1; }

this(int[] a1, int[] a2)
in {
assert(a1.length == a2.length);
} body {
arr1 = a1;
arr2 = a2;
}

bool empty() {
return i = arr1.length;
}

@property Element front() {
return Element(arr1[i], arr2[i]);
}

void popFront() {
i++;
}
}

void main() {
auto a = 
[18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28];
auto b = 
[9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25];
int tot = 0;
foreach (i; 0 .. 50_000_000)
foreach (xy; zip(a, b))
tot += xy._0 + xy._1;
printf(%d\n, tot);
}

--

Just loops program #5:

LCE:mov 024h[ESP],EBX
lea EBX,054h[ESP]
xor EDX,EDX
mov [EBX],EDX
mov EAX,014h[ESP]
lea ESI,054h[ESP]
mov 4[EBX],EDX
lea EDI,034h[ESP]
mov 8[EBX],EDX
mov 0Ch[EBX],EDX
mov 010h[EBX],EDX
mov EDX,018h[ESP]
mov 058h[ESP],EDX
mov EDX,020h[ESP]
mov 054h[ESP],EAX
mov EAX,01Ch[ESP]
mov 05Ch[ESP],EAX
mov 060h[ESP],EDX
movsd
movsd
movsd
movsd
movsd
mov ECX,044h[ESP]
cmp ECX,034h[ESP]
jae L167
L11D:   mov EBX,044h[ESP]
mov EDX,038h[ESP]
mov ESI,[EBX*4][EDX]
mov EAX,034h[ESP]
mov EDX,040h[ESP]
mov ECX,[EBX*4][EDX]
mov 078h[ESP],ECX
mov EAX,03Ch[ESP]
mov EDX,078h[ESP]
mov 074h[ESP],ESI
mov EAX,074h[ESP]
mov EBX,074h[ESP]
mov 070h[ESP],EDX
add EBX,070h[ESP]
add EBP,EBX
inc dword ptr 044h[ESP]
mov ESI,044h[ESP]
mov 06Ch[ESP],EAX
cmp ESI,034h[ESP]
jb  L11D
L167:   mov EBX,024h[ESP]
inc EBX
cmp EBX,02FAF080h
jb  LCE

--


Re: std.range.zip performance

2011-02-15 Thread Andrei Alexandrescu

Initial: 58 seconds.

Eliminated the switch in popFront: 53s.

Replaced emplace with assignment: 23s.

Specialized emplace for non-struct types, reinserted: 23s.

Eliminated the loop in empty (replaced with return ranges[0].empty;): 17s.

I'm sure there are ways to further improve this, but there are a few 
difficulties. Each pass through the loop the code must transport values 
from the two arrays into a specific format and then distribute them for 
further calculation. Then, upon each popFront two words must be touched 
because arrays hold pointer+length, not pointer+pointer (as probably 
would be better since ranges are around).



Nice analysis!

Andrei

On 2/15/11 6:24 PM, bearophile wrote:

While doing some benchmarks on the Clay language I've found something 
interesting.

I have timed a zip() done on two short arrays, and I have found this D version 
too much slow compared to normal array code. Higher order functions like zip, 
map, filter, etc are going to stay, they aren't toys, so I think the D compiler 
has to digest them better.

(There is lot of of asm at the tail of this post, I don't know if some of t 
will get cut away.)

Bye,
bearophile


Timings, seconds, best of 3:
   #1:  1.95
   #2: 61.50
   #3:  2.25
   #4:  1.52

--

Binary sizes, bytes:
   #1:  33_016
   #2: 284_188
   #3: 103_964
   #4: 103_964

--

Compilers and command line used:

DMD 2.051
clay compiler (hg id dec41ba07d58 tip, llvm r122866, Jan  5 2011)

clay -inline test1.clay -o test1.exe
dmd -O -release -inline test2.d
dmd -O -release -inline test3.d

To produce asm code with Clay:
clay -inline -asm test1.clay -o test1.s

Output: -1149672960 for all versions.

--

Code used:


// program #1 (Clay)
main() {
 var a = 
Vector([18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28]);
 var b = 
Vector([9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25]);
 var tot = 0; // 32 bit signed int
 for (i in range(0, 50*1000*1000))
 for (x, y in zipped(a, b))
 tot += x + y;
 println(tot);
}

--

// program #2 (D2)
import std.c.stdio: printf;
import std.range: zip;
void main() {
 auto a = 
[18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28];
 auto b = 
[9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25];
 int tot = 0;
 foreach (i; 0 .. 50_000_000)
 foreach (xy; zip(a, b))
 tot += xy[0] + xy[1];
 printf(%d\n, tot);
}

--

// program #3 (D2)
import std.c.stdio: printf;
void main() {
 auto a = 
[18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28];
 auto b = 
[9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25];
 int tot = 0;
 foreach (i; 0 .. 50_000_000)
 foreach (j; 0 .. a.length)
 tot += a[j] + b[j];
 printf(%d\n, tot);
}

--

// program #4 (D2)
import std.c.stdio: printf;
void main() {
 auto a = 
[18,10,17,14,19,23,11,0,6,0,17,25,5,4,19,21,17,13,5,7,11,22,23,17,24,7,11,11,1,28];
 auto b = 
[9,12,1,4,1,18,11,6,5,18,24,15,26,14,24,8,17,26,23,17,3,28,27,0,9,27,0,19,13,25];
 int tot = 0;
 foreach (i; 0 .. 50_000_000)
 foreach (j; 0 .. 30)
 tot += a[j] + b[j];
 printf(%d\n, tot);
}

--

Just loops program #1:

 movl$5000, %eax # imm = 0x2FAF080
 .align16, 0x90
LBB2_4: # %bb.nph.i.i
 # =This Loop Header: Depth=1
 # Child Loop BB2_5 Depth 2
 xorl%edx, %edx
 .align16, 0x90
LBB2_5: # %return410.i.i
 #   Parent Loop BB2_4 Depth=1
 # =   This Inner Loop Header: Depth=2
 addl(%esi,%edx,4), %ecx
 addl(%edi,%edx,4), %ecx
 incl%edx
 cmpl$30, %edx
 jneLBB2_5
# BB#3: # %return272.loopexit.i.i
 #   in Loop: Header=BB2_4 Depth=1
 decl%eax
 jneLBB2_4

--

Just loops program #2:

LCE:pushdword ptr 018h[ESP]
 movESI,offset FLAT:_D3std5range14__T3ZipTAiTAiZ3Zip6__initZ
 pushdword ptr 018h[ESP]
 pushdword ptr 028h[ESP]
 pushdword ptr 028h[ESP]
 push0
 leaEDI,078h[ESP]
 movsd
 movsd
 movsd
 movsd
 movsd
 leaEAX,078h[ESP]
 callnear ptr 
_D3std5range14__T3ZipTAiTAiZ3Zip6__ctorMFNcAiAiE3std5range14StoppingPolicyZS3std5range14__T3ZipTAiTAiZ3Zip
 movESI,EAX
 leaEDI,044h[ESP]
 movsd
 movsd
 movsd