Re: Speed kills

2016-03-09 Thread Jon D via Digitalmars-d

On Wednesday, 9 March 2016 at 20:30:10 UTC, Jon D wrote:


I seen a few cases while exploring D.

Turns out there are issues filed for each of the performance 
issues I mentioned:


* Lower casing strings:  
https://issues.dlang.org/show_bug.cgi?id=11229
* Large associative arrays:  
https://issues.dlang.org/show_bug.cgi?id=2504
* Associative arrays - Checking membership with mutable values 
(char arrays) rather strings (immutable):  
https://issues.dlang.org/show_bug.cgi?id=15038






Re: Speed kills

2016-03-09 Thread John Colvin via Digitalmars-d

On Wednesday, 9 March 2016 at 21:01:13 UTC, H. S. Teoh wrote:
On Wed, Mar 09, 2016 at 08:30:10PM +, Jon D via 
Digitalmars-d wrote:

On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
>[...]


In the case of std.algorithm.sum, the focus is on accuracy 
rather than performance.  It does some extra work to ensure 
maximum accuracy in the result, so it shouldn't be expected to 
have top performance.  Granted, though, the docs could be 
clearer about this accuracy vs. performance tradeoff.  Please 
file a bug on this (or better yet, submit a PR for it).


In any case, 4 times slower sounds a bit excessive... it would 
be good to investigate why this is happening and fix it.


I think you'd be good at reviewing this: 
https://github.com/D-Programming-Language/phobos/pull/4069




Re: Speed kills

2016-03-09 Thread H. S. Teoh via Digitalmars-d
On Wed, Mar 09, 2016 at 08:30:10PM +, Jon D via Digitalmars-d wrote:
> On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
> >Since I posted this thread I've learned std.algorithm.sum is 4 times
> >slower than a naive loop sum. Even if this is for reasons of accuracy
> >this is exactly what I am talking about- this is a hidden iceberg of
> >terrible performance that will reflect poorly on D. That's so slow
> >the function needs a health warning.

In the case of std.algorithm.sum, the focus is on accuracy rather than
performance.  It does some extra work to ensure maximum accuracy in the
result, so it shouldn't be expected to have top performance.  Granted,
though, the docs could be clearer about this accuracy vs. performance
tradeoff.  Please file a bug on this (or better yet, submit a PR for
it).

In any case, 4 times slower sounds a bit excessive... it would be good
to investigate why this is happening and fix it.


> I seen a few cases while exploring D. Not all fully researched
> (apologies for that), but since there appears to be interest in
> identification I'll list them.

One of the big performance killers that people are likely to run into is
autodecoding for strings in all range-based algorithms.  Walter has
repeatedly complained about this, but so far Andrei (and some others)
have resisted getting rid of autodecoding.  Fortunately, you can
(mostly) suppress this by wrapping your strings in
std.encoding.codeUnits before operating on them. Nevertheless, it's one
of those hidden gotchas that could result in poorer performance than
what you could get otherwise.

Another big performance killer that I repeatedly run into is the
overly-eager GC collection schedule.  Some may argue against the use of
the GC, period, but I have found that even with the GC, it's possible to
get 30-40% speedups just by calling GC.disable() and manually triggering
GC.collect() at a lower frequency than the default. This is especially
true when your program performs many allocations of small objects, like
(small) strings. I have observed this in at least 2-3 different
memory-intensive programs of mine, including a recent experiment in
writing a faster CSV parser than std.csv.  Arguably, we should fix this
in druntime itself so that collection cycles are run less often, though
I'm not sure what implications this may have on low-memory systems.
Sometimes, you don't even need to do this for the entire program; you
can get big performance boosts just by wrapping GC.disable() /
GC.enable() around strategic sections of allocation-intensive code.


> * Lower-casing strings (likely upper-casing), and some character type
> checks.
> 
> Routines like to toLower and asLowerCase call functions that work for
> all unicode characters. I was able to create much faster versions by
> checking if the character was ascii, then calling either the ascii
> version or the more general version. Same is true for a few routines
> like isNumber. Some have the ascii check optimization built in, but
> not all.
> 
> If this optimization is added, it might also be useful to add a few
> common combinations (or a generic solution, if that's feasible). For
> example, to check if a character is alpha-numeric, one currently ORs
> two tests from the standard library, isAlpha and isNumber. Putting in
> an ascii optimization check requires putting it before doing the OR,
> rather than inside the tests being ORed.

These sound like valuable additions to Phobos. Submit a PR, please?


[...]
> * Associative arrays - Converting keys to immutable versions for lookup
> 
> Associative arrays want immutable values as keys. Far as I can tell,
> immutable values are also required when performing a lookup, even if a
> new entry won't be stored. A couple apps I've written walk through
> large lists of text values, naturally available as char[] because they
> are read from input streams. To test presence in an associative array,
> it's necessary to copy them to immutable strings first. I haven't
> fully researched this one, but my experience is that copying from
> char[] to string becomes a meaningful cost.
[...]

This is arguably a bug.  If you're only looking up a key, you should be
able to pass in const(char)[] rather than immutable(char)[] (aka
string).  The GC performance problem I mentioned above is especially
noticeable when there are large numbers of small allocations, as would
be the case if you constantly have to call .idup just because AA lookups
demand immutable keys.  Please file an issue for this, if there isn't
one already filed.


T

-- 
Windows 95 was a joke, and Windows 98 was the punchline.


Re: Speed kills

2016-03-09 Thread Jon D via Digitalmars-d

On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
Since I posted this thread I've learned std.algorithm.sum is 4 
times slower than a naive loop sum. Even if this is for reasons 
of accuracy this is exactly what I am talking about- this is a 
hidden iceberg of terrible performance that will reflect poorly 
on D. That's so slow the function needs a health warning.


I seen a few cases while exploring D. Not all fully researched 
(apologies for that), but since there appears to be interest in 
identification I'll list them.


* Lower-casing strings (likely upper-casing), and some character 
type checks.


Routines like to toLower and asLowerCase call functions that work 
for all unicode characters. I was able to create much faster 
versions by checking if the character was ascii, then calling 
either the ascii version or the more general version. Same is 
true for a few routines like isNumber. Some have the ascii check 
optimization built in, but not all.


If this optimization is added, it might also be useful to add a 
few common combinations (or a generic solution, if that's 
feasible). For example, to check if a character is alpha-numeric, 
one currently ORs two tests from the standard library, isAlpha 
and isNumber. Putting in an ascii optimization check requires 
putting it before doing the OR, rather than inside the tests 
being ORed.


* Large associative arrays

When associative arrays get beyond about 10 million entries 
performance starts to decline. I believe this is due to resizing 
the arrays. It's worse with strings as keys than integers as 
keys. Having a way to reserve capacity may help under some 
circumstances.


* Associative arrays - Converting keys to immutable versions for 
lookup


Associative arrays want immutable values as keys. Far as I can 
tell, immutable values are also required when performing a 
lookup, even if a new entry won't be stored. A couple apps I've 
written walk through large lists of text values, naturally 
available as char[] because they are read from input streams. To 
test presence in an associative array, it's necessary to copy 
them to immutable strings first. I haven't fully researched this 
one, but my experience is that copying from char[] to string 
becomes a meaningful cost.


On the surface, this appears to be an optimization opportunity, 
to create the immutable strings only when actually storing a new 
value.


--Jon


Re: Speed kills

2016-03-09 Thread jmh530 via Digitalmars-d

On Wednesday, 9 March 2016 at 13:42:40 UTC, cym13 wrote:


They just don't do the same thing, sum() uses pairwise 
summation which is safer as I understand it. Corresponding 
issue: https://issues.dlang.org/show_bug.cgi?id=15722


That third comment about how it's not obvious which algorithm sum 
is using is a good one.


Re: Speed kills

2016-03-09 Thread John Colvin via Digitalmars-d
On Wednesday, 9 March 2016 at 14:04:40 UTC, Andrei Alexandrescu 
wrote:

On 3/9/16 9:03 AM, John Colvin wrote:
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei 
Alexandrescu wrote:

On 03/08/2016 09:14 AM, ixid wrote:

[...]


Whoa. What's happening there? Do we have anyone on it? -- 
Andrei


Ilya has long term plans for this, but I have a short-term fix 
that will
buy us a very large performance improvement here (if my old 
benchmarks

were correct). Give me a few mins to prep the pull request :)


Thanks much! -- Andrei


https://github.com/D-Programming-Language/phobos/pull/4069


Re: Speed kills

2016-03-09 Thread Andrei Alexandrescu via Digitalmars-d

On 3/9/16 9:03 AM, John Colvin wrote:

On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu wrote:

On 03/08/2016 09:14 AM, ixid wrote:

Since I posted this thread I've learned std.algorithm.sum is 4 times
slower than a naive loop sum. Even if this is for reasons of accuracy
this is exactly what I am talking about- this is a hidden iceberg of
terrible performance that will reflect poorly on D. That's so slow the
function needs a health warning.


Whoa. What's happening there? Do we have anyone on it? -- Andrei


Ilya has long term plans for this, but I have a short-term fix that will
buy us a very large performance improvement here (if my old benchmarks
were correct). Give me a few mins to prep the pull request :)


Thanks much! -- Andrei


Re: Speed kills

2016-03-09 Thread John Colvin via Digitalmars-d
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu 
wrote:

On 03/08/2016 09:14 AM, ixid wrote:
Since I posted this thread I've learned std.algorithm.sum is 4 
times
slower than a naive loop sum. Even if this is for reasons of 
accuracy
this is exactly what I am talking about- this is a hidden 
iceberg of
terrible performance that will reflect poorly on D. That's so 
slow the

function needs a health warning.


Whoa. What's happening there? Do we have anyone on it? -- Andrei


Ilya has long term plans for this, but I have a short-term fix 
that will buy us a very large performance improvement here (if my 
old benchmarks were correct). Give me a few mins to prep the pull 
request :)


Re: Speed kills

2016-03-09 Thread cym13 via Digitalmars-d
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu 
wrote:

On 03/08/2016 09:14 AM, ixid wrote:
Since I posted this thread I've learned std.algorithm.sum is 4 
times
slower than a naive loop sum. Even if this is for reasons of 
accuracy
this is exactly what I am talking about- this is a hidden 
iceberg of
terrible performance that will reflect poorly on D. That's so 
slow the

function needs a health warning.


Whoa. What's happening there? Do we have anyone on it? -- Andrei


They just don't do the same thing, sum() uses pairwise summation 
which is safer as I understand it. Corresponding issue: 
https://issues.dlang.org/show_bug.cgi?id=15722


Re: Speed kills

2016-03-09 Thread Daniel Kozak via Digitalmars-d



Dne 9.3.2016 v 14:26 Andrei Alexandrescu via Digitalmars-d napsal(a):

On 03/08/2016 09:14 AM, ixid wrote:

Since I posted this thread I've learned std.algorithm.sum is 4 times
slower than a naive loop sum. Even if this is for reasons of accuracy
this is exactly what I am talking about- this is a hidden iceberg of
terrible performance that will reflect poorly on D. That's so slow the
function needs a health warning.


Whoa. What's happening there? Do we have anyone on it? -- Andrei


I guess he speaks about this one:

http://forum.dlang.org/post/mailman.4748.1456070484.22025.digitalmars-d-le...@puremagic.com


Re: Speed kills

2016-03-09 Thread Andrei Alexandrescu via Digitalmars-d

On 03/08/2016 09:14 AM, ixid wrote:

Since I posted this thread I've learned std.algorithm.sum is 4 times
slower than a naive loop sum. Even if this is for reasons of accuracy
this is exactly what I am talking about- this is a hidden iceberg of
terrible performance that will reflect poorly on D. That's so slow the
function needs a health warning.


Whoa. What's happening there? Do we have anyone on it? -- Andrei



Re: Speed kills

2016-03-08 Thread jmh530 via Digitalmars-d

On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:


Since I posted this thread I've learned std.algorithm.sum is 4 
times slower than a naive loop sum. Even if this is for reasons 
of accuracy this is exactly what I am talking about- this is a 
hidden iceberg of terrible performance that will reflect poorly 
on D. That's so slow the function needs a health warning.


There was a longer discussion here
https://forum.dlang.org/post/vkiwojmfjrwhigbke...@forum.dlang.org



Re: Speed kills

2016-03-08 Thread ixid via Digitalmars-d

On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
Every time there is a D thread on reddit it feels like the new 
user is expecting mind-blowing speed from D.


https://www.reddit.com/r/programming/comments/45v03g/porterstemmerd_an_implementation_of_the_porter/

This is the most recent one where John Colvin provided some 
pointers to speed it up significantly. Walter has done some 
good work taking the low-hanging fruit to speed up DMD code and 
there is a lot of effort going on with reference counting 
machinery but I wondered if some of the common errors people 
make that slow down D code can be addressed?


Literals used to be a hidden speed bump but I think that was 
improved, now the append operator is one of the most common 
culprits, can this not be enhanced behind the scenes to work 
more like append? Do others notice common pitfalls between the 
article code and what the D community then suggests where we 
can bridge the gap so naive users get faster code?


Since I posted this thread I've learned std.algorithm.sum is 4 
times slower than a naive loop sum. Even if this is for reasons 
of accuracy this is exactly what I am talking about- this is a 
hidden iceberg of terrible performance that will reflect poorly 
on D. That's so slow the function needs a health warning.


Re: Speed kills

2016-03-07 Thread Marco Leise via Digitalmars-d
Am Wed, 17 Feb 2016 19:55:08 +
schrieb Basile B. :

> Also, forgot to say, but an uniform API is needed to set the 
> rounding mode, whether SSE is used or the FPU...

At least GCC has a codegen switch for that. A solution would
have to either set both rounding modes at once or the
compilers would need to expose version MathFPU/MathSSE.

-- 
Marco



Re: Speed kills

2016-02-21 Thread Luc J. Bourhis via Digitalmars-d

On Wednesday, 17 February 2016 at 19:01:38 UTC, Basile B. wrote:
The oldest 32 bit processor (X86) doesn't support SSE, maybe 
MMX (not sure). So when we do "cast(int) 0.1;" on X86, the 
backend always generate FPU instructions.


SSE goes back to Pentium III, doesn't it? And the Pentium 4 
supported SSE3, didn't it? Is it an active specification of D to 
run on Pentium II e.g.?


Re: Speed kills

2016-02-21 Thread w0rp via Digitalmars-d
I think it's important that DMD gets more of the easier 
optimisations. Most new users won't bother trying GDC or LDC, and 
if DMD doesn't generate fast enough code, they might leave before 
they try the compilers with better optimisations.


Re: Speed kills

2016-02-17 Thread Basile B. via Digitalmars-d

On Wednesday, 17 February 2016 at 19:01:38 UTC, Basile B. wrote:

That's more subtile than that.

The oldest 64 bit processor (AMD64) supports SSE, always. So 
when we do "cast(int) 0.1;" on X86_64, the backend always 
generate SSE instructions.


The oldest 32 bit processor (X86) doesn't support SSE, maybe 
MMX (not sure). So when we do "cast(int) 0.1;" on X86, the 
backend always generate FPU instructions.


This is how I understand the post 'I've landed onto'.
My current work always use SSE so it's not conform with the "at 
least available" feature.


Also, forgot to say, but an uniform API is needed to set the 
rounding mode, whether SSE is used or the FPU...


Re: Speed kills

2016-02-17 Thread Basile B. via Digitalmars-d
On Wednesday, 17 February 2016 at 18:50:45 UTC, Jack Stouffer 
wrote:

On Wednesday, 17 February 2016 at 18:26:47 UTC, Basile B. wrote:
Anyway, not good for phobos, why? When looking for 
documentation yesterday night I've landed on a post by Walter 
who explained that the library for a system programming 
language shouldn't be specific to an architecture.


While I don't know about the post you're talking about, I don't 
think what Walter says applies to internal version blocks in a 
function. You could make it so on AMD lround and friends are 
much faster by using those ASM routines. Also, std.math is 
already chock full of architecture specific code.


That's more subtile than that.

The oldest 64 bit processor (AMD64) supports SSE, always. So when 
we do "cast(int) 0.1;" on X86_64, the backend always generate SSE 
instructions.


The oldest 32 bit processor (X86) doesn't support SSE, maybe MMX 
(not sure). So when we do "cast(int) 0.1;" on X86, the backend 
always generate FPU instructions.


This is how I understand the post 'I've landed onto'.
My current works always use SSE so it's not conform with the "at 
least available" feature.


Re: Speed kills

2016-02-17 Thread Jack Stouffer via Digitalmars-d

On Wednesday, 17 February 2016 at 18:26:47 UTC, Basile B. wrote:
Anyway, not good for phobos, why? When looking for 
documentation yesterday night I've landed on a post by Walter 
who explained that the library for a system programming 
language shouldn't be specific to an architecture.


While I don't know about the post you're talking about, I don't 
think what Walter says applies to internal version blocks in a 
function. You could make it so on AMD lround and friends are much 
faster by using those ASM routines. Also, std.math is already 
chock full of architecture specific code.


Re: Speed kills

2016-02-17 Thread Basile B. via Digitalmars-d

On Monday, 15 February 2016 at 23:13:13 UTC, Jack Stouffer wrote:

On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:
we could get ceil/trunc/round/floor, also almost as easily 
fmod, hypoth.

classic but I dont get why thery're not in std.math.


Seems like you know a lot about the subject, and I know you 
contributed to phobos before, so how about making a PR for this 
:)


In the meantime:

https://github.com/BBasile/iz/blob/master/import/iz/math.d

Actually when i've participated to this conversation I didn't 
remember that it was not good on X86. Using SSE rouding is really 
only good on AMD64, otherwise loading the input parameter "sucks" 
a lot (even for a 32 bit float since it's not directly in EAX or 
XMMO).


Anyway, not good for phobos, why? When looking for documentation 
yesterday night I've landed on a post by Walter who explained 
that the library for a system programming language shouldn't be 
specific to an architecture.





Re: Speed kills

2016-02-16 Thread Guillaume Piolat via Digitalmars-d

On Monday, 15 February 2016 at 23:35:54 UTC, Basile B. wrote:
On Monday, 15 February 2016 at 23:19:44 UTC, Guillaume Piolat 
wrote:


lround and friends have been a big performance problem at 
times.

Everytime you can use cast(int) instead, it's way faster.


I didn't know this trick. It generates almost the same sse 
intruction (it truncates) and has the advantage to be 
inline-able.


Is it documented somewhere ? If not it should.


In SSE3 you also get an instruction that does this without 
messing with the x87 control word: FISTTP.


Re: Speed kills

2016-02-15 Thread Basile B. via Digitalmars-d
On Monday, 15 February 2016 at 23:19:44 UTC, Guillaume Piolat 
wrote:


lround and friends have been a big performance problem at times.
Everytime you can use cast(int) instead, it's way faster.


I didn't know this trick. It generates almost the same sse 
intruction (it truncates) and has the advantage to be inline-able.


Is it documented somewhere ? If not it should.




Re: Speed kills

2016-02-15 Thread Guillaume Piolat via Digitalmars-d

On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:

Same for std.math.lround

they use the FP way while for float and double it's only one 
sse instruction. Typically with 6 functions similar to this one:



int round(float value)
{
asm
{
naked;
cvtss2si EAX, XMM0;
ret;
}
}

we could get ceil/trunc/round/floor, also almost as easily 
fmod, hypoth.

classic but I dont get why thery're not in std.math.

Goddamnit, we're in 2016.


lround and friends have been a big performance problem at times.
Everytime you can use cast(int) instead, it's way faster.


Re: Speed kills

2016-02-15 Thread Jack Stouffer via Digitalmars-d

On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:
we could get ceil/trunc/round/floor, also almost as easily 
fmod, hypoth.

classic but I dont get why thery're not in std.math.


Seems like you know a lot about the subject, and I know you 
contributed to phobos before, so how about making a PR for this :)


Re: Speed kills

2016-02-15 Thread Basile B. via Digitalmars-d
On Monday, 15 February 2016 at 14:16:02 UTC, Guillaume Piolat 
wrote:

On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
This is the most recent one where John Colvin provided some 
pointers to speed it up significantly. Walter has done some 
good work taking the low-hanging fruit to speed up DMD code 
and there is a lot of effort going on with reference counting 
machinery but I wondered if some of the common errors people 
make that slow down D code can be addressed?


Something that annoyed me a bit is floating-point comparisons, 
DMD does not seem to be able to handle them from SSE registers, 
it will convert to FPU and do the comparison there IIRC.


Same for std.math.lround

they use the FP way while for float and double it's only one sse 
instruction. Typically with 6 functions similar to this one:



int round(float value)
{
asm
{
naked;
cvtss2si EAX, XMM0;
ret;
}
}

we could get ceil/trunc/round/floor, also almost as easily fmod, 
hypoth.

classic but I dont get why thery're not in std.math.

Goddamnit, we're in 2016.


Re: Speed kills

2016-02-15 Thread rsw0x via Digitalmars-d

On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
Every time there is a D thread on reddit it feels like the new 
user is expecting mind-blowing speed from D.


[...]


if you want better codegen, don't use dmd.
use ldc, it's usualy only a version-ish behind dmd.


Re: Speed kills

2016-02-15 Thread Wyatt via Digitalmars-d
On Monday, 15 February 2016 at 14:16:02 UTC, Guillaume Piolat 
wrote:


Something that annoyed me a bit is floating-point comparisons, 
DMD does not seem to be able to handle them from SSE registers, 
it will convert to FPU and do the comparison there IIRC.


I feel like this point comes up often, and that a lot of people 
have argued x87 FP should just not happen anymore.


-Wyatt


Re: Speed kills

2016-02-15 Thread Guillaume Piolat via Digitalmars-d

On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
This is the most recent one where John Colvin provided some 
pointers to speed it up significantly. Walter has done some 
good work taking the low-hanging fruit to speed up DMD code and 
there is a lot of effort going on with reference counting 
machinery but I wondered if some of the common errors people 
make that slow down D code can be addressed?


Something that annoyed me a bit is floating-point comparisons, 
DMD does not seem to be able to handle them from SSE registers, 
it will convert to FPU and do the comparison there IIRC.


Speed kills

2016-02-15 Thread ixid via Digitalmars-d
Every time there is a D thread on reddit it feels like the new 
user is expecting mind-blowing speed from D.


https://www.reddit.com/r/programming/comments/45v03g/porterstemmerd_an_implementation_of_the_porter/

This is the most recent one where John Colvin provided some 
pointers to speed it up significantly. Walter has done some good 
work taking the low-hanging fruit to speed up DMD code and there 
is a lot of effort going on with reference counting machinery but 
I wondered if some of the common errors people make that slow 
down D code can be addressed?


Literals used to be a hidden speed bump but I think that was 
improved, now the append operator is one of the most common 
culprits, can this not be enhanced behind the scenes to work more 
like append? Do others notice common pitfalls between the article 
code and what the D community then suggests where we can bridge 
the gap so naive users get faster code?