[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2021-04-23 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

Iain Buclaw  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Iain Buclaw  ---
cfloat_unary_add: 15 secs, 195 ms, 935 μs, and 5 hnsecs
std_cfloat_unary_add: 2 secs, 491 ms, 834 μs, and 9 hnsecs

cfloat_unary_sub: 14 secs, 926 ms, 587 μs, and 6 hnsecs
std_cfloat_unary_sub: 4 secs, 858 ms, 349 μs, and 4 hnsecs

cfloat_binary_add: 22 secs, 363 ms, 951 μs, and 9 hnsecs
std_cfloat_binary_add: 5 secs, 403 ms, 108 μs, and 9 hnsecs

cfloat_binary_sub: 22 secs, 236 ms, and 902 μs
std_cfloat_binary_sub: 5 secs, 266 ms, 697 μs, and 6 hnsecs

cfloat_binary_mul: 24 secs, 858 ms, 63 μs, and 7 hnsecs
std_cfloat_binary_mul: 7 secs, 186 ms, 291 μs, and 8 hnsecs

cfloat_binary_div: 30 secs, 225 ms, 114 μs, and 4 hnsecs
std_cfloat_binary_div: 17 secs, 900 ms, 164 μs, and 6 hnsecs

cfloat_binary_div(FastMath): 29 secs, 230 ms, 821 μs, and 5 hnsecs
std_cfloat_binary_div(FastMath): 12 secs, 208 ms, 118 μs, and 7 hnsecs


cdouble_unary_add: 2 secs, 788 ms, 525 μs, and 6 hnsecs
std_cdouble_unary_add: 2 secs, 922 ms, 224 μs, and 1 hnsec

cdouble_unary_sub: 2 secs, 502 ms, and 734 μs
std_cdouble_unary_sub: 2 secs, 915 ms, 203 μs, and 9 hnsecs

cdouble_binary_add: 2 secs, 869 ms, 820 μs, and 1 hnsec
std_cdouble_binary_add: 3 secs, 108 ms, 545 μs, and 4 hnsecs

cdouble_binary_sub: 2 secs, 836 ms, 796 μs, and 5 hnsecs
std_cdouble_binary_sub: 3 secs, 159 ms, 209 μs, and 3 hnsecs

cdouble_binary_mul: 4 secs, 785 ms, 197 μs, and 6 hnsecs
std_cdouble_binary_mul: 5 secs, 197 ms, 572 μs, and 9 hnsecs

cdouble_binary_div: 14 secs, 238 ms, 332 μs, and 6 hnsecs
std_cdouble_binary_div: 15 secs, 933 ms, 301 μs, and 8 hnsecs

cdouble_binary_div(FastMath): 10 secs, 700 ms, and 32 μs
std_cdouble_binary_div(FastMath): 11 secs, 8 ms, 868 μs, and 5 hnsecs


creal_unary_add: 8 secs, 183 ms, 254 μs, and 3 hnsecs
std_creal_unary_add: 14 secs, 72 ms, 96 μs, and 2 hnsecs

creal_unary_sub: 8 secs, 425 ms, 681 μs, and 9 hnsecs
std_creal_unary_sub: 10 secs, 854 ms, 312 μs, and 8 hnsecs

creal_binary_add: 3 minutes, 50 secs, 877 ms, 637 μs, and 6 hnsecs
std_creal_binary_add: 3 minutes, 57 secs, 397 ms, 952 μs, and 4 hnsecs

creal_binary_sub: 4 minutes, 4 secs, 982 ms, 715 μs, and 2 hnsecs
std_creal_binary_sub: 4 minutes, 11 secs, 485 ms, 74 μs, and 8 hnsecs

creal_binary_mul: 11 minutes, 31 secs, 328 ms, 600 μs, and 7 hnsecs
std_creal_binary_mul: 11 minutes, 46 secs, 26 ms, 451 μs, and 2 hnsecs

creal_binary_div: 20 minutes, 48 secs, 778 ms, and 747 μs
std_creal_binary_div: 20 minutes, 2 secs, 439 ms, and 535 μs

creal_binary_div(FastMath): 18 minutes, 38 secs, 613 ms, 679 μs, and 6 hnsecs
std_creal_binary_div(FastMath): 18 minutes, 42 secs, 400 ms, 343 μs, and 7
hnsecs

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2021-04-16 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

Iain Buclaw  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #15 from Iain Buclaw  ---
Not sure if this should really be marked as resolved/fixed, but anyhow...

With the following (lazy) function generator:
---
import std.complex : C = Complex;
import std.meta : AliasSeq;
import std.format : format;

static foreach (T; AliasSeq!(cfloat, cdouble, creal))
{
// Unary operators
mixin(format!"%s %s_unary_add(%s a) { return +a; }"
  (T.stringof, T.stringof, T.stringof));
mixin(format!"%s %s_unary_sub(%s a) { return -a; }"
  (T.stringof, T.stringof, T.stringof));

// Binary operators
mixin(format!"%s %s_binary_add(%s a, %s b) { return a + b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
mixin(format!"%s %s_binary_sub(%s a, %s b) { return a - b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
mixin(format!"%s %s_binary_mul(%s a, %s b) { return a * b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
mixin(format!"%s %s_binary_div(%s a, %s b) { return a / b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
}

static foreach (T; AliasSeq!(float, double, real))
{
// Unary operators
mixin(format!"C!%s std_c%s_unary_add(C!%s a) { return +a; }"
  (T.stringof, T.stringof, T.stringof));
mixin(format!"C!%s std_c%s_unary_sub(C!%s a) { return -a; }"
  (T.stringof, T.stringof, T.stringof));

// Binary operators
mixin(format!"C!%s std_c%s_binary_add(C!%s a, C!%s b) { return a + b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
mixin(format!"C!%s std_c%s_binary_sub(C!%s a, C!%s b) { return a - b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
mixin(format!"C!%s std_c%s_binary_mul(C!%s a, C!%s b) { return a * b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
mixin(format!"C!%s std_c%s_binary_div(C!%s a, C!%s b) { return a / b; }"
  (T.stringof, T.stringof, T.stringof, T.stringof));
}
---


On x86_64/GDC, the results are:


cfloat_unary_add:
movq%xmm0, -8(%rsp)
movss   -8(%rsp), %xmm0
movss   %xmm0, -16(%rsp)
movss   -4(%rsp), %xmm0
movss   %xmm0, -12(%rsp)
movq-16(%rsp), %xmm0
ret
---
std_cfloat_unary_add:
ret



cdouble_unary_add:
ret
---
std_cdouble_unary_add:
ret



creal_unary_add:
fldt8(%rsp)
fldt24(%rsp)
fxch%st(1)
ret
---
std_creal_unary_add:
movdqa  8(%rsp), %xmm0
movdqa  24(%rsp), %xmm1
movq%rdi, %rax
movaps  %xmm0, (%rdi)
movaps  %xmm1, 16(%rdi)
ret



cfloat_unary_sub:
movq%xmm0, -8(%rsp)
movss   -8(%rsp), %xmm0
movss   .LC4(%rip), %xmm2
movaps  %xmm0, %xmm1
movss   -4(%rsp), %xmm0
xorps   %xmm2, %xmm1
xorps   %xmm2, %xmm0
movss   %xmm1, -16(%rsp)
movss   %xmm0, -12(%rsp)
movq-16(%rsp), %xmm0
ret
.LC4:
.long   -2147483648
.long   0
.long   0
.long   0
---
std_cfloat_unary_sub:
movq.LC7(%rip), %xmm1
xorps   %xmm1, %xmm0
ret
.LC7:
.long   -2147483648
.long   -2147483648



cdouble_unary_sub:
movq.LC5(%rip), %xmm2
xorpd   %xmm2, %xmm1
xorpd   %xmm2, %xmm0
ret
.LC5:
.long   0
.long   -2147483648
.long   0
.long   0
---
std_cdouble_unary_sub:
movq%xmm0, -24(%rsp)
movq%xmm1, -16(%rsp)
movapd  -24(%rsp), %xmm2
xorpd   .LC8(%rip), %xmm2
movaps  %xmm2, -24(%rsp)
movsd   -16(%rsp), %xmm1
movsd   -24(%rsp), %xmm0
ret
.LC8:
.long   0
.long   -2147483648
.long   0
.long   -2147483648



creal_unary_sub:
fldt8(%rsp)
fchs
fldt24(%rsp)
fchs
fxch%st(1)
ret
---
std_creal_unary_sub:
fldt24(%rsp)
movq%rdi, %rax
fchs
fldt8(%rsp)
fchs
fstpt   (%rdi)
fstpt   16(%rdi)
ret



cfloat_binary_add:
movq%xmm0, -8(%rsp)
movq%xmm1, -16(%rsp)
movss   -8(%rsp), %xmm1
movss   -16(%rsp), %xmm0
addss   %xmm0, %xmm1
movss   -12(%rsp), %xmm0
addss   -4(%rsp), %xmm0
movss   %xmm1, 

[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2021-03-24 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #14 from ponce  ---
No problem from here, a lot of our complex code is now SIMD. I doubt we'll see
a practical problem apart from the transition work. It's easy to recreate the
desired division algorithm manually if ever needed.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2021-03-24 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #13 from Iain Buclaw  ---
(In reply to ponce from comment #4)
> RESULTS
> 
> * With ldc 1.8.0 64-bit:
> 
> $ ldc2.exe -O3 -enable-inlining -release divide.d -m64
> $ divide.exe
> 
> With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs
> With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs
> With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs
> With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs
> 
> 
> * With ldc 1.8.0 32-bit:
> 
> $ ldc2.exe -O3 -enable-inlining -release divide.d -m32
> $ divide.exe
> 
> With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec
> With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs
> With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs
> With Complex!double: 15 secs, 414 ms, and 344 ╬╝s
> 
> 
> This show that even with latest LDC you can have a regression.
> 
> I appreciate that std.complex gives more precision in the divide operation,
> it's also something that is _different_ from builtin complex it replaces.

A bug probably should be raised against LDC for not using range reduction (i.e:
Smiths algorithm) in their native complex division implementation.

The slowdown is not a regression, LDC is just using the wrong algorithm by
default (i.e: the "fast" naive version should be generated only when compiling
with `-ffast-math`).

GDC and LDC could coordinate with each other and predefine `version(FastMath)`
when the `-ffast-math` flag is given on the command-line.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2021-03-22 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

Dlang Bot  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Dlang Bot  ---
dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes algorithm for
complex divide" was merged into master:

- 70595f5d51011a6258d001523c8749411b9d8152 by Iain Buclaw:
  fix Issue 18627 - Use cephes algorithm for complex divide

https://github.com/dlang/phobos/pull/7814

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2021-02-27 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

Dlang Bot  changed:

   What|Removed |Added

   Keywords||pull

--- Comment #11 from Dlang Bot  ---
@ibuclaw created dlang/phobos pull request #7814 "fix Issue 18627 - Use cephes
algorithm for complex divide" fixing this issue:

- fix Issue 18627 - Use cephes algorithm for complex divide

https://github.com/dlang/phobos/pull/7814

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2020-07-05 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #10 from Iain Buclaw  ---
(In reply to ponce from comment #4)
> This benchmark is a variation that does only division.
> 
> --- divide.d

* With gdc -O2 -frelease -m64

With cfloat: 11 secs, 204 ms, 475 μs, and 2 hnsecs
With cdouble: 13 secs, 420 ms, 497 μs, and 6 hnsecs
With Complex!float: 4 secs, 689 ms, 546 μs, and 2 hnsecs
With Complex!double: 8 secs, 903 ms, 172 μs, and 4 hnsecs

* With gdc -O2 -frelease -m32

With cfloat: 29 secs, 471 ms, 678 μs, and 9 hnsecs
With cdouble: 29 secs, 176 ms, 189 μs, and 2 hnsecs
With Complex!float: 13 secs, 379 ms, 856 μs, and 8 hnsecs
With Complex!double: 18 secs, 240 ms, 975 μs, and 5 hnsecs

Native complex floating point must die.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-22 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #9 from ponce  ---
I think at the very least std.complex should contain a function to divide
Complex without the additional precision provided by the check with the 2
fabs().

People that want speed could opt-in, and others will enjoy increased precision
without noticing.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-20 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #8 from ponce  ---
@Seb: It's not only about DMD, there is a 2x performance regression with
Complex!double vs cdouble using LDC. There are probably more I haven't exposed
yet. And yes, I use cdouble for designing IIR filters, in a real-time program.

Our main product use builtin complexes, it's downloaded 2000 times per month.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-20 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

Seb  changed:

   What|Removed |Added

 CC||greensunn...@gmail.com

--- Comment #7 from Seb  ---
> Division with DMD 32-bit:

Using DMD for any performance arguments is a bit of a moot point as DMD's
optimizer is pretty bad. So this would halt almost all development as there are
many many performance regressions with DMD.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-20 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #6 from ponce  ---
Conversely complex divide seems faster with DMD with std.complex than builtins.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-20 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #5 from ponce  ---
Division with DMD 32-bit:

With cfloat: 1 minute, 18 secs, 451 ms, 932 ╬╝s, and 9 hnsecs
With cdouble: 1 minute, 19 secs, 747 ms, 70 ╬╝s, and 5 hnsecs
With Complex!float: 27 secs, 412 ms, 926 ╬╝s, and 5 hnsecs
With Complex!double: 25 secs, 39 ms, 159 ╬╝s, and 2 hnsecs

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-20 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #4 from ponce  ---
This benchmark is a variation that does only division.

--- divide.d

import std.string;
import std.datetime;
import std.datetime.stopwatch : benchmark, StopWatch;
import std.complex;
import std.stdio;
import std.math;

void main()
{
int[] divider = new int[1024];
cfloat[] A = new cfloat[1024];
cdouble[] B = new cdouble[1024];
Complex!float[] C = new Complex!float[1024];
Complex!double[] D = new Complex!double[1024];
foreach(i; 0..1024)
{
divider[i] = (i*69060) / 1;
// Initialize with something
A[i] = i + 1i;
B[i] = i + 1i;
C[i] = Complex!float(i, 1);
D[i] = Complex!double(i, 1);
}

void justDivide(ComplexType)(ComplexType[] arr)
{
int size = cast(int)(arr.length);
for (int i = 0; i < size; ++i)
{
arr[i] = divider[i] / arr[i];
}
}

void fA()
{
justDivide!(cfloat)(A);
}

void fB()
{
justDivide!(cdouble)(B);
}

void fC()
{
justDivide!(Complex!float)(C);
}

void fD()
{
justDivide!(Complex!double)(D);
}

auto r = benchmark!(fA, fB, fC, fD)(100);

{
writefln("With cfloat: %s", r[0] );
writefln("With cdouble: %s", r[1] );
writefln("With Complex!float: %s", r[2] );
writefln("With Complex!double: %s", r[3] );
}
}

RESULTS

* With ldc 1.8.0 64-bit:

$ ldc2.exe -O3 -enable-inlining -release divide.d -m64
$ divide.exe

With cfloat: 7 secs, 623 ms, 829 ╬╝s, and 9 hnsecs
With cdouble: 7 secs, 594 ms, 449 ╬╝s, and 8 hnsecs
With Complex!float: 7 secs, 988 ms, 642 ╬╝s, and 4 hnsecs
With Complex!double: 15 secs, 501 ms, 128 ╬╝s, and 4 hnsecs


* With ldc 1.8.0 32-bit:

$ ldc2.exe -O3 -enable-inlining -release divide.d -m32
$ divide.exe

With cfloat: 7 secs, 618 ms, 202 ╬╝s, and 1 hnsec
With cdouble: 7 secs, 593 ms, 777 ╬╝s, and 2 hnsecs
With Complex!float: 7 secs, 958 ms, 692 ╬╝s, and 9 hnsecs
With Complex!double: 15 secs, 414 ms, and 344 ╬╝s


This show that even with latest LDC you can have a regression.

I appreciate that std.complex gives more precision in the divide operation,
it's also something that is _different_ from builtin complex it replaces.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-20 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

Iain Buclaw  changed:

   What|Removed |Added

 CC||ibuc...@gdcproject.org

--- Comment #3 from Iain Buclaw  ---
FYI, GDC is missing, but I'll post it anyway, along with DMD as a comparative
benchmark, because each machine is different and DMD may optimize weirdly for
one CPU but is perfectly fine for another (see for instance issue 5100)


DMD64 D Compiler v2.076.1
---
$ dmd complex.d -O -inline -release
With cfloat: 75 ms, 688 μs, and 2 hnsecs
With cdouble: 61 ms, 546 μs, and 7 hnsecs
With Complex!float: 161 ms, 816 μs, and 8 hnsecs
With Complex!double: 109 ms, 66 μs, and 1 hnsec
---

There seems to be room for improvement in dmd or the general phobos
implementation.


gdc (GCC) 8.0.1 20180226 (2.076.1 library and patches)
---
$ gdc complex.d -O2 -frelease
With cfloat: 154 ms, 871 μs, and 8 hnsecs
With cdouble: 59 ms, 205 μs, and 7 hnsecs
With Complex!float: 32 ms, 566 μs, and 5 hnsecs
With Complex!double: 34 ms, 961 μs, and 6 hnsecs
---

However with gdc, std.complex is /faster/ than native.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-18 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

--- Comment #2 from ponce  ---
I've posted there, thanks.

--


[Issue 18627] std.complex is a lot slower than builtin complex types at number crunching

2018-03-18 Thread d-bugmail--- via Digitalmars-d-bugs
https://issues.dlang.org/show_bug.cgi?id=18627

greenify  changed:

   What|Removed |Added

 CC||greeen...@gmail.com

--- Comment #1 from greenify  ---
See also: https://github.com/dlang/dmd/pull/7640

--