Re: Optimization fun

2014-11-08 Thread Dmitry Olshansky via Digitalmars-d

08-Nov-2014 03:22, Walter Bright пишет:

On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:

That's the problem with profilers:
 they say what takes time but not why :)

Often I find myself looking at memcpy at the top of the list, so
obvious the
textbook answer is to optimize memcpy ;) In contrast it should be
read as you
seem to do excessive copying of data.


dmd's profiler will give you a tree showing who called each function how
many times. This helps a lot with the why question.



This information is very limited esp in case of recursive functions, 
it's still more of what instead of why. Why always takes a bit of 
research.


Also I'd suggest to use many different profilers.

--
Dmitry Olshansky


Re: Optimization fun

2014-11-08 Thread H. S. Teoh via Digitalmars-d
On Sat, Nov 08, 2014 at 02:29:25AM +, Kiith-Sa via Digitalmars-d wrote:
[...]
 Except for very specific cases, neither gprof or DMD's profiler are
 good for profiling. If you're on Linux, you have perf, which works
 well with D and is way ahead of anything the DMD profiler will be able
 to do *after* man-years of further development.
 
 See (shameless advertisement):
 
 http://defenestrate.eu/_static/profiling-slides/index.html#22
 
 (especially perf record/perf report)

The slides mention not to use gprof, but never explain why?

Also, the recommendations are kinda not what I'm looking for; I'm not
looking so much for interactive performance (trying to optimize branch
predictions, etc.), because I'm anticipating that in the future, most of
the problems I'll be running will be I/O bound because the size of the
problems will *far* exceed available RAM. Mispredicted branches or L1/L2
cache misses are kinda irrelevant in this scenario. Or at least, they're
not as important as they're made out to be, since what will matter more
is how to minimize disk seeks and how to keep frequently-accessed cached
disk pages in the CPU cache. So it's still somewhat relevant, just not
as much as other applications like interactive games.


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder


Re: Optimization fun

2014-11-07 Thread thedeemon via Digitalmars-d
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via 
Digitalmars-d wrote:
1) The GC could use some serious improvement: it just so 
happens that
the solver's algorithm only ever needs to allocate memory, 
never release
it (it keeps a hash of visited states that only grows, never 
shrinks).


When you grow a hash table it periodically reallocates bucket 
arrays it uses internally, so some garbage to collect appears 
anyway even if you only add elements.




Re: Optimization fun

2014-11-07 Thread Steven Schveighoffer via Digitalmars-d

On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote:


A little more investigation revealed the culprit: a queue object
implemented with subarrays, and subarrays were being used as stacks by
appending with ~= and popping by decrementing .length. A dangerous
combination that causes excessive reallocations! After inserting
assumeSafeAppend() in the right place, the memory consumption rate
dropped to more reasonable levels. It's still visibly faster than when
GC is enabled, so probably there are other places with excessive
allocations, but this particular change stood out because after adding
assumeSafeAppend() to the code, the solver's performance *even with GC
enabled* improved by about 40%-50%. So this explains why calling
GC.disable() had such a pronounced effect before: the GC is too busy
cleaning up after unnecessarily allocation-hungry code!


In D1 this would have worked fine. A queue array wrapper may be a good 
candidate for Phobos that works just like a normal array but calls 
assume safe append whenever decrementing length.


-Steve



Re: Optimization fun

2014-11-07 Thread Steven Schveighoffer via Digitalmars-d

On 11/7/14 8:43 AM, Steven Schveighoffer wrote:

On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote:


A little more investigation revealed the culprit: a queue object
implemented with subarrays, and subarrays were being used as stacks by
appending with ~= and popping by decrementing .length. A dangerous
combination that causes excessive reallocations! After inserting
assumeSafeAppend() in the right place, the memory consumption rate
dropped to more reasonable levels. It's still visibly faster than when
GC is enabled, so probably there are other places with excessive
allocations, but this particular change stood out because after adding
assumeSafeAppend() to the code, the solver's performance *even with GC
enabled* improved by about 40%-50%. So this explains why calling
GC.disable() had such a pronounced effect before: the GC is too busy
cleaning up after unnecessarily allocation-hungry code!


In D1 this would have worked fine. A queue array wrapper may be a good
candidate for Phobos that works just like a normal array but calls
assume safe append whenever decrementing length.


Also to note -- even with this, you will create garbage. As the array 
grows, it may have to relocate to a new block. The old block would be 
garbage.


Once you get large enough, there is a chance that it will just extend 
into new pages. But if it ever has to reallocate a very large array, 
then you just created a very large piece of garbage that will not be 
cleaned up.


I'm wondering if said queue array should have an option to forcefully 
delete the old array. In which case, we'd have to make sure the data 
cannot be accessed directly, or mark it as unsafe.


-Steve



Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 09:28:56AM +, thedeemon via Digitalmars-d wrote:
 On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d
 wrote:
 1) The GC could use some serious improvement: it just so happens that
 the solver's algorithm only ever needs to allocate memory, never
 release it (it keeps a hash of visited states that only grows, never
 shrinks).
 
 When you grow a hash table it periodically reallocates bucket arrays
 it uses internally, so some garbage to collect appears anyway even if
 you only add elements.

I realize that, it's just that even after accounting for that, the
memory consumption rate is much higher than expected, and sure enough
further investigation revealed unnecessary GC load caused by
reallocating arrays where (re)using an existing one would suffice.

In the case of the function that used to return an array, which would
allocate a new array every call (and there are hundreds of thousands of
calls for this particular test case), after implementing buffer reuse,
even though it does need to reallocate the buffer sometimes when the
array runs out of space, it only reallocates 3-4 times for the entire
run.

I'm not sure what the overhead for the hash table would be, but I'm
expecting it shouldn't be more than, say, 50-100 reallocations of the
bucket array (I didn't measure this). So it shouldn't add up to *that*
much.


T

-- 
There is no gravity. The earth sucks.


Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 08:46:29AM -0500, Steven Schveighoffer via 
Digitalmars-d wrote:
 On 11/7/14 8:43 AM, Steven Schveighoffer wrote:
 On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 
 A little more investigation revealed the culprit: a queue object
 implemented with subarrays, and subarrays were being used as stacks
 by appending with ~= and popping by decrementing .length. A
 dangerous combination that causes excessive reallocations! After
 inserting assumeSafeAppend() in the right place, the memory
 consumption rate dropped to more reasonable levels. It's still
 visibly faster than when GC is enabled, so probably there are other
 places with excessive allocations, but this particular change stood
 out because after adding assumeSafeAppend() to the code, the
 solver's performance *even with GC enabled* improved by about
 40%-50%. So this explains why calling GC.disable() had such a
 pronounced effect before: the GC is too busy cleaning up after
 unnecessarily allocation-hungry code!
 
 In D1 this would have worked fine. A queue array wrapper may be a
 good candidate for Phobos that works just like a normal array but
 calls assume safe append whenever decrementing length.
 
 Also to note -- even with this, you will create garbage. As the array
 grows, it may have to relocate to a new block. The old block would be
 garbage.
 
 Once you get large enough, there is a chance that it will just extend
 into new pages. But if it ever has to reallocate a very large array,
 then you just created a very large piece of garbage that will not be
 cleaned up.

Good point. Maybe that's where the remaining garbage is coming from:
remnants of reallocated arrays. Some of the subarrays I'm using does
grow pretty big, so things could add up.


 I'm wondering if said queue array should have an option to
 forcefully delete the old array. In which case, we'd have to make sure
 the data cannot be accessed directly, or mark it as unsafe.
[...]

Well, if we're going to manually manage the array's memory, then we're
back in malloc/free territory along with all of their pitfalls. :-) So I
wouldn't sweat it too much -- I'd make the queue a non-copyable
reference type (so there would be no aliasing of the queue array) and
leave it at that.

On another note, I wonder if reducing the frequency of GC collections
might help keep memory under control while minimizing the performance
hits. As a first stab, I could try to manually trigger a collection
every, say, 100,000 iterations of the main loop, which would be about 10
or so collections per run, which should keep memory usage under control
while avoiding excessively frequent collections that degrade
performance.


T

-- 
I speak better English than this villain Bush -- Mohammed Saeed al-Sahaf, 
Iraqi Minister of Information


Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 09:30:47AM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 On another note, I wonder if reducing the frequency of GC collections
 might help keep memory under control while minimizing the performance
 hits. As a first stab, I could try to manually trigger a collection
 every, say, 100,000 iterations of the main loop, which would be about
 10 or so collections per run, which should keep memory usage under
 control while avoiding excessively frequent collections that degrade
 performance.
[...]

OK, I did a little test where I would disable the GC at the beginning of
the program, then trigger a collection every 250,000 iterations. As a
rough visual aid, I print a message before and after the collection so
that I can gauge how long each collection is taking. It appears that as
the program progresses, collections become longer -- expected, because
the hash table is growing rapidly, so the marking phase probably would
take longer (I didn't measure this in detail though).

Overall, a puzzle that took ~7.0 seconds to solve now takes ~7.6
seconds, which is about 8% performance degradation. That's pretty high,
considering that we already know that most of the new allocations
between collections is live. Is there a way to tell the GC to ignore the
hash table buckets for collection purposes? It would seem that most of
the work is being wasted there since we already know it will be live
until the end of the program.

Another alternative is to manually call GC.free on reallocated arrays in
the two places I've identified, and turn off collections completely.
This could be a better approach in this case, since we already know
where the reallocations happen.

Another GC-related question: if the heap grows too large to fit in RAM,
and a GC collection cycle is triggered, will this cause thrashing? --
since the GC will have to scan through all the blocks, causing *all* of
them to be paged in and paged out again.

Of course, this little side-exercise is more of a curiosity than
anything else, since I'm planning to replace the hash table with
something else that manually manages memory (and page to disk as
necessary) anyway, so any optimizations along these lines will be moot.
But it's interesting to explore how far one can go with the current
system.


T

-- 
It is not the employer who pays the wages. Employers only handle the money. It 
is the customer who pays the wages. -- Henry Ford


Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 10:01:41AM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 OK, I did a little test where I would disable the GC at the beginning of
 the program, then trigger a collection every 250,000 iterations. As a
 rough visual aid, I print a message before and after the collection so
 that I can gauge how long each collection is taking. It appears that as
 the program progresses, collections become longer -- expected, because
 the hash table is growing rapidly, so the marking phase probably would
 take longer (I didn't measure this in detail though).
 
 Overall, a puzzle that took ~7.0 seconds to solve now takes ~7.6
 seconds, which is about 8% performance degradation. That's pretty high,
 considering that we already know that most of the new allocations
 between collections is live.
[...]

P.S. Note, however, that with the GC fully enabled, the average running
time is about 9.8 seconds (the 7.0s figure is when the GC is completely
disabled and no collections are run, so memory is leaking). That's still
a ~25% improvement over the normal GC behaviour, while preventing memory
leaks, even if 8% is a little high for a scenario where we already know
most of the memory is still live. (One could see an argument for a
generational collector here!)

So arguably, the frequency of the GC's collections could be tweaked, as
Steven mentioned. We could be getting ~25% performance improvement for
free! (But this is of course highly dependent on system-specific and
program-specific measurements, so I don't know if this will actually
improve performance in general. Still, it's an interesting area to
explore in terms of improving the performance of the current GC. Where's
Andrei with his nice shiny yet-to-be-materialized new GC when we need
him??! :-P)


T

-- 
640K ought to be enough -- Bill G. (allegedly), 1984. The Internet is not a 
primary goal for PC usage -- Bill G., 1995. Linux has no impact on 
Microsoft's strategy -- Bill G., 1999.


Re: Optimization fun

2014-11-07 Thread Ary Borenszweig via Digitalmars-d

On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote:

So today, I was playing around with profiling and optimizing my sliding
block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never release
it (it keeps a hash of visited states that only grows, never shrinks).
The profiler indicated that one of the hotspots is the GC. So I added an
option to the program to call GC.disable() if a command line option is
given. The result is a 40%-50% reduction in running time.

2) Auto-decoding is evil: the solver currently uses a naïve char[]
representation for the puzzle board, and uses countUntil() extensively
to locate things on the board. The original code had:

auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually
no reason to do so: ch is a char, and therefore countUntil could've used
strchr instead). Simply changing that to:

auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.


3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked at
gdc's output instead. Digging into the disassembly, I saw that it was
using a foreach loop over the char[], but for some reason even gdc -O3
failed to simplify that loop significantly (I'm not sure why -- maybe
what threw off the optimizer is the array indexing + length check
operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

 auto ptr = current.ptr;
 auto c = current.length;
 while (c  0  *ptr != m.id)
 ptr++;
 cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly had
no check for (c  0)... could it be because there is an assert following
the loop claiming that that will never happen, therefore gdc elided the
check? I thought Walter hasn't gotten around to implementing
optimizations based on assert yet??  Anyway, with this optimization, I
managed to shave off another 3%-5% of the total running time.

On that note, at one point I was wondering why gdc -O3 didn't generate a
rep scasb for the inner loop instead of manually incrementing the loop
pointer; but a little research revealed that I'm about 6-7 years out of
date: since about 2008 gcc's optimizer has not used rep scasb because
on modern hardware it has atrocious performance -- much worse than a
manually-written C loop! So much for built-in string processing that
used to be touted in the old days. Yet another proof of the rule that
overly-complex CPU instruction sets rarely actually perform better.


Anyway, after everything is said and done, a puzzle that used to take
about 20 seconds to solve now only takes about 6 seconds. W00t!


T



Hi,

Is the code public? I'd like to port it to other languages and see how 
they behave, see if this is a general problem or just something specific 
to D.


Thanks!


Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d 
wrote:
[...]
 Is the code public? I'd like to port it to other languages and see how
 they behave, see if this is a general problem or just something
 specific to D.
[...]

I haven't posted the code anywhere yet, but I could certainly send you a
copy if you want.

However, the test input I'm using is not public, so you'll have to craft
your own input files for testing purposes. The public input files I do
have are too simple to pose a significant challenge to the solver (plus,
they default to BFS instead of A*, so you might get different results
from what I described here).


T

-- 
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day 
and hour soon are coming / When all the IT folks say Gosh! / It isn't from a 
clever lawsuit / That Windowsland will finally fall, / But thousands writing 
open source code / Like mice who nibble through a wall. -- The Linux-nationale 
by Greg Baker


Re: Optimization fun

2014-11-07 Thread Ary Borenszweig via Digitalmars-d

On 11/7/14, 4:16 PM, H. S. Teoh via Digitalmars-d wrote:

On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d 
wrote:
[...]

Is the code public? I'd like to port it to other languages and see how
they behave, see if this is a general problem or just something
specific to D.

[...]

I haven't posted the code anywhere yet, but I could certainly send you a
copy if you want.

However, the test input I'm using is not public, so you'll have to craft
your own input files for testing purposes. The public input files I do
have are too simple to pose a significant challenge to the solver (plus,
they default to BFS instead of A*, so you might get different results
from what I described here).


T



Then I don't think I'll be able to do any useful benchmark/profile with 
that. Thanks anyway :-)


Re: Optimization fun

2014-11-07 Thread Dmitry Olshansky via Digitalmars-d

07-Nov-2014 09:05, H. S. Teoh via Digitalmars-d пишет:

On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]

1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never
release it (it keeps a hash of visited states that only grows, never
shrinks).  The profiler indicated that one of the hotspots is the GC.
So I added an option to the program to call GC.disable() if a command
line option is given. The result is a 40%-50% reduction in running
time.

[...]

Yet more GC drama unfolded tonight: while testing my new GC-disabling
option on several other puzzles, one instance ate up my PC's RAM at an
astonishing rate, causing the system to almost freeze up from the
ensuing thrashing. A little investigation revealed that there are
several instances of excessive GC load caused by unnecessary
(re)allocations.



That's the problem with profilers:
they say what takes time but not why :)

Often I find myself looking at memcpy at the top of the list, so obvious 
the textbook answer is to optimize memcpy ;) In contrast it should be 
read as you seem to do excessive copying of data.



The moral of the story, is that code like the following should raise big
red warning flags:

struct MyData(T) {
T[] data;
T pop() {
...
data.length--; // --- yikes
...
}
void push(T t) {
...
data ~= t; // --- yikes
...
}
}


Well known in the inner circles performance bug. Sadly most of D 
programmers are bound to hit it once in the while.


Memory safety by default has its costs.
Though I'd just call mmap, far more predictable then these pesky memory 
allocators + no need to reallocate if reserved range is large enough.



--
Dmitry Olshansky


Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Sat, Nov 08, 2014 at 01:58:31AM +0300, Dmitry Olshansky via Digitalmars-d 
wrote:
 07-Nov-2014 09:05, H. S. Teoh via Digitalmars-d пишет:
[...]
 Yet more GC drama unfolded tonight: while testing my new GC-disabling
 option on several other puzzles, one instance ate up my PC's RAM at
 an astonishing rate, causing the system to almost freeze up from the
 ensuing thrashing. A little investigation revealed that there are
 several instances of excessive GC load caused by unnecessary
 (re)allocations.
 
 That's the problem with profilers:
   they say what takes time but not why :)

True! But it's better than the stereotypical C veteran's approach of I
know from 50 years' experience (aka personal bias) that X is faster than
Y because it's more terse, so I'm gonna optimize all my code into
unreadable code from the get-go even if 99% of it isn't the real hotspot
anyway.


 Often I find myself looking at memcpy at the top of the list, so
 obvious the textbook answer is to optimize memcpy ;) In contrast it
 should be read as you seem to do excessive copying of data.
[...]

Well, when I saw countUntil at the top of the list, it wasn't
immediately obvious *why* it was at the top of the list. In the grand
scheme of things, it's just a small cog in a large machinery filled
with many other things that could easily be considered far better
candidates for performance bottlenecks. And while optimizing countUntil
itself did yield significant improvement, a deeper investigation led to
the realization that the underlying logic of the code was causing it to
call countUntil more often than it needs to. Fixing this issue
drastically reduced the number of calls to countUntil, which yielded no
small overall improvement.

In any case, the old optimization adage applies: you should optimize the
underlying algorithms (i.e., the big-O's) before you spend time
optimizing the low-level primitives (the scalar factors).  No matter how
much you optimize the heck out of the scalar factor, an O(N^2) algorithm
is still an O(N^2) algorithm.  As long as the algorithm remains O(N^2),
you aren't going to get much further.


T

-- 
Life is unfair. Ask too much from it, and it may decide you don't deserve what 
you have now either.


Re: Optimization fun

2014-11-07 Thread Walter Bright via Digitalmars-d

On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:

That's the problem with profilers:
 they say what takes time but not why :)

Often I find myself looking at memcpy at the top of the list, so obvious the
textbook answer is to optimize memcpy ;) In contrast it should be read as you
seem to do excessive copying of data.


dmd's profiler will give you a tree showing who called each function how many 
times. This helps a lot with the why question.




Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 04:22:49PM -0800, Walter Bright via Digitalmars-d wrote:
 On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:
 That's the problem with profilers:
  they say what takes time but not why :)
 
 Often I find myself looking at memcpy at the top of the list, so
 obvious the textbook answer is to optimize memcpy ;) In contrast it
 should be read as you seem to do excessive copying of data.
 
 dmd's profiler will give you a tree showing who called each function
 how many times. This helps a lot with the why question.

gprof also shows that in the section section of its output.

But speaking of which, I found dmd -profile's output in trace.log a
little difficult to understand because of the lack of self-documenting
headers in the first section. gprof, for example, produces
nicely-labelled output that describes what the data means. Would you
accept a PR along these lines? (Or is there a post-processing step that
I'm missing?)

Also, in the second section, I'm getting some negative numbers for the
call times, which seem to indicate integer overflow? This basically
destroys the usefulness of dmd -profile for my test cases, since most of
the interesting cases for me are those with long running times, which
for sure will run into integer overflow at the sample rate that dmd
-profile is using, causing the second section to be essentially randomly
sorted.


T

-- 
He who does not appreciate the beauty of language is not worthy to bemoan its 
flaws.


Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 04:41:27PM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 gprof also shows that in the section section of its output.
[...]
   ^^^
Argh, I mean *second* section (which, incidentally, corresponds with the
first section of trace.log in the output generated by dmd -profile).
Dyslexic today. :-(


T

-- 
Study gravitation, it's a field with a lot of potential.


Re: Optimization fun

2014-11-07 Thread Walter Bright via Digitalmars-d

On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote:

On Fri, Nov 07, 2014 at 04:22:49PM -0800, Walter Bright via Digitalmars-d wrote:

On 11/7/2014 2:58 PM, Dmitry Olshansky wrote:

That's the problem with profilers:
 they say what takes time but not why :)

Often I find myself looking at memcpy at the top of the list, so
obvious the textbook answer is to optimize memcpy ;) In contrast it
should be read as you seem to do excessive copying of data.


dmd's profiler will give you a tree showing who called each function
how many times. This helps a lot with the why question.


gprof also shows that in the section section of its output.

But speaking of which, I found dmd -profile's output in trace.log a
little difficult to understand because of the lack of self-documenting
headers in the first section. gprof, for example, produces
nicely-labelled output that describes what the data means. Would you
accept a PR along these lines?


I don't really know what you have in mind, so I'll have to say it depends. 
Keep in mind that subsequent runs of the profiler parse the previous output file 
and add it to the results.




Also, in the second section, I'm getting some negative numbers for the
call times, which seem to indicate integer overflow? This basically
destroys the usefulness of dmd -profile for my test cases, since most of
the interesting cases for me are those with long running times, which
for sure will run into integer overflow at the sample rate that dmd
-profile is using, causing the second section to be essentially randomly
sorted.


Yeah, a negative number is likely an overflow. Reduce the test case!



Re: Optimization fun

2014-11-07 Thread H. S. Teoh via Digitalmars-d
On Fri, Nov 07, 2014 at 05:31:44PM -0800, Walter Bright via Digitalmars-d wrote:
 On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote:
[...]
 But speaking of which, I found dmd -profile's output in trace.log a
 little difficult to understand because of the lack of
 self-documenting headers in the first section. gprof, for example,
 produces nicely-labelled output that describes what the data means.
 Would you accept a PR along these lines?
 
 I don't really know what you have in mind, so I'll have to say it
 depends.  Keep in mind that subsequent runs of the profiler parse the
 previous output file and add it to the results.

Just add a simple header that describes what each column means.


 Also, in the second section, I'm getting some negative numbers for
 the call times, which seem to indicate integer overflow? This
 basically destroys the usefulness of dmd -profile for my test cases,
 since most of the interesting cases for me are those with long
 running times, which for sure will run into integer overflow at the
 sample rate that dmd -profile is using, causing the second section to
 be essentially randomly sorted.
 
 Yeah, a negative number is likely an overflow. Reduce the test case!

Unfortunately, I can't. The behaviour I'm trying to profile is the
long-term running case, because there's a whole bunch of setup at the
beginning that initially takes a while, but eventually it's the
long-running part that dominates the runtime behaviour. Reducing the
test case will only increase the initial noise, which I'm not interested
in, and reduce the running time of the main loop of interest. Besides,
what I'm trying to optimize for is the large, complex case; the simple
cases already run fast enough that the performance characteristics are
not that interesting to me. It's the long-term part that's interesting
because that's when the GC starts kicking in and pressure on system RAM
starts to increase.

I'm surprised that dmd's profiler can't even handle something that only
runs for 7-8 seconds or so! gprof certainly has no such limitation.

Is it relatively simple to make dmd -profile use larger integer widths
for profiling? If not, I'm afraid I'm just gonna have to stick with
gdc/gprof instead.


T

-- 
Your inconsistency is the only consistent thing about you! -- KD


Re: Optimization fun

2014-11-07 Thread Walter Bright via Digitalmars-d

On 11/7/2014 5:51 PM, H. S. Teoh via Digitalmars-d wrote:

I'm surprised that dmd's profiler can't even handle something that only
runs for 7-8 seconds or so!


It's based on a design I wrote decades ago, when machines were a lot slower.


Is it relatively simple to make dmd -profile use larger integer widths
for profiling?


I don't know. Haven't looked at it for a while.



Re: Optimization fun

2014-11-07 Thread Kiith-Sa via Digitalmars-d
On Saturday, 8 November 2014 at 01:53:33 UTC, H. S. Teoh via 
Digitalmars-d wrote:
On Fri, Nov 07, 2014 at 05:31:44PM -0800, Walter Bright via 
Digitalmars-d wrote:

On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote:

[...]
But speaking of which, I found dmd -profile's output in 
trace.log a

little difficult to understand because of the lack of
self-documenting headers in the first section. gprof, for 
example,
produces nicely-labelled output that describes what the data 
means.

Would you accept a PR along these lines?

I don't really know what you have in mind, so I'll have to say 
it
depends.  Keep in mind that subsequent runs of the profiler 
parse the

previous output file and add it to the results.


Just add a simple header that describes what each column means.


Also, in the second section, I'm getting some negative 
numbers for

the call times, which seem to indicate integer overflow? This
basically destroys the usefulness of dmd -profile for my test 
cases,

since most of the interesting cases for me are those with long
running times, which for sure will run into integer overflow 
at the
sample rate that dmd -profile is using, causing the second 
section to

be essentially randomly sorted.

Yeah, a negative number is likely an overflow. Reduce the test 
case!


Unfortunately, I can't. The behaviour I'm trying to profile is 
the
long-term running case, because there's a whole bunch of setup 
at the

beginning that initially takes a while, but eventually it's the
long-running part that dominates the runtime behaviour. 
Reducing the
test case will only increase the initial noise, which I'm not 
interested
in, and reduce the running time of the main loop of interest. 
Besides,
what I'm trying to optimize for is the large, complex case; the 
simple
cases already run fast enough that the performance 
characteristics are
not that interesting to me. It's the long-term part that's 
interesting
because that's when the GC starts kicking in and pressure on 
system RAM

starts to increase.

I'm surprised that dmd's profiler can't even handle something 
that only
runs for 7-8 seconds or so! gprof certainly has no such 
limitation.


Is it relatively simple to make dmd -profile use larger integer 
widths
for profiling? If not, I'm afraid I'm just gonna have to stick 
with

gdc/gprof instead.


T


Except for very specific cases, neither gprof or DMD's profiler 
are good for profiling. If you're on Linux, you have perf, which 
works well with D and is way ahead of anything the DMD profiler 
will be able to do *after* man-years of further development.


See (shameless advertisement):

http://defenestrate.eu/_static/profiling-slides/index.html#22

(especially perf record/perf report)


Optimization fun

2014-11-06 Thread H. S. Teoh via Digitalmars-d
So today, I was playing around with profiling and optimizing my sliding
block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never release
it (it keeps a hash of visited states that only grows, never shrinks).
The profiler indicated that one of the hotspots is the GC. So I added an
option to the program to call GC.disable() if a command line option is
given. The result is a 40%-50% reduction in running time.

2) Auto-decoding is evil: the solver currently uses a naïve char[]
representation for the puzzle board, and uses countUntil() extensively
to locate things on the board. The original code had:

auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually
no reason to do so: ch is a char, and therefore countUntil could've used
strchr instead). Simply changing that to:

auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.


3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked at
gdc's output instead. Digging into the disassembly, I saw that it was
using a foreach loop over the char[], but for some reason even gdc -O3
failed to simplify that loop significantly (I'm not sure why -- maybe
what threw off the optimizer is the array indexing + length check
operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

auto ptr = current.ptr;
auto c = current.length;
while (c  0  *ptr != m.id)
ptr++;
cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly had
no check for (c  0)... could it be because there is an assert following
the loop claiming that that will never happen, therefore gdc elided the
check? I thought Walter hasn't gotten around to implementing
optimizations based on assert yet??  Anyway, with this optimization, I
managed to shave off another 3%-5% of the total running time.

On that note, at one point I was wondering why gdc -O3 didn't generate a
rep scasb for the inner loop instead of manually incrementing the loop
pointer; but a little research revealed that I'm about 6-7 years out of
date: since about 2008 gcc's optimizer has not used rep scasb because
on modern hardware it has atrocious performance -- much worse than a
manually-written C loop! So much for built-in string processing that
used to be touted in the old days. Yet another proof of the rule that
overly-complex CPU instruction sets rarely actually perform better.


Anyway, after everything is said and done, a puzzle that used to take
about 20 seconds to solve now only takes about 6 seconds. W00t!


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert


Re: Optimization fun

2014-11-06 Thread H. S. Teoh via Digitalmars-d
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 auto ptr = current.ptr;
 auto c = current.length;
 while (c  0  *ptr != m.id)
 ptr++;
 cert.headOffset = ptr - current.ptr;
 
 This pushed gdc far enough in the right direction that it finally
 produced a 3-instruction inner loop. Strangely enough, the assembly had
 no check for (c  0)... could it be because there is an assert following
 the loop claiming that that will never happen, therefore gdc elided the
 check? I thought Walter hasn't gotten around to implementing
 optimizations based on assert yet??
[...]

Argh... the reason c0 had no check was because the code is wrong: c is
never decremented. :-(  Replacing (c  0) with (c--  0) reduced the
performance improvement to about 3% instead of 4%-5%. Oh well. :-/

mouth.open(); mouth.insert(foot);


T

-- 
I am a consultant. My job is to make your job redundant. -- Mr Tom


Re: Optimization fun

2014-11-06 Thread Walter Bright via Digitalmars-d

On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:

2) Auto-decoding is evil: the solver currently uses a naïve char[]
representation for the puzzle board, and uses countUntil() extensively
to locate things on the board. The original code had:

auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually
no reason to do so: ch is a char, and therefore countUntil could've used
strchr instead). Simply changing that to:

auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.


I've argued this point for over a year, apparently without convincing anyone :-(

BTW, use .byChar. instead of .representation. because the former leaves it typed 
as 'char' and the latter converts it to 'ubyte'.




3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked at
gdc's output instead. Digging into the disassembly, I saw that it was
using a foreach loop over the char[], but for some reason even gdc -O3
failed to simplify that loop significantly (I'm not sure why -- maybe
what threw off the optimizer is the array indexing + length check
operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

 auto ptr = current.ptr;
 auto c = current.length;
 while (c  0  *ptr != m.id)
 ptr++;
 cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly had
no check for (c  0)... could it be because there is an assert following
the loop claiming that that will never happen, therefore gdc elided the
check? I thought Walter hasn't gotten around to implementing
optimizations based on assert yet??  Anyway, with this optimization, I
managed to shave off another 3%-5% of the total running time.


Walter hasn't gotten around to implementing optimizations in GDC, that's fer 
sure! :-)




Re: Optimization fun

2014-11-06 Thread Walter Bright via Digitalmars-d

On 11/6/2014 3:13 PM, H. S. Teoh via Digitalmars-d wrote:

mouth.open(); mouth.insert(foot);


Now you've done it. You must pull the Stone of Shame!

https://pbs.twimg.com/media/BrENFq3CMAEqFUL.jpg



Re: Optimization fun

2014-11-06 Thread Sean Kelly via Digitalmars-d

On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via
Digitalmars-d wrote:
So today, I was playing around with profiling and optimizing my 
sliding

block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so 
happens that
the solver's algorithm only ever needs to allocate memory, 
never release
it (it keeps a hash of visited states that only grows, never 
shrinks).
The profiler indicated that one of the hotspots is the GC. So I 
added an
option to the program to call GC.disable() if a command line  
option is

given. The result is a 40%-50% reduction in running time.


We've talked before about keeping statistics of how much memory
has been reclaimed by collections, and use that to influence
whether a collection should be run before going to the OS for
more.  Something like this should definitely get a bugzilla
ticket.

For now, you might want to consider guessing at how much memory
you'll need and calling GC.reserve() as well.  This should
improve the performance of your app even further.



2) Auto-decoding is evil


Yep, it's the worst.  Most string operations don't even need
auto-decoding to work properly.  I honestly have no idea why we
have this feature.


Re: Optimization fun

2014-11-06 Thread H. S. Teoh via Digitalmars-d
On Thu, Nov 06, 2014 at 03:15:04PM -0800, Walter Bright via Digitalmars-d wrote:
 On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
 2) Auto-decoding is evil: the solver currently uses a naïve char[]
 representation for the puzzle board, and uses countUntil()
 extensively to locate things on the board. The original code had:
 
  auto offset = currentBoard.countUntil(ch);
 
 which, of course, triggers autodecoding (even though there is
 actually no reason to do so: ch is a char, and therefore countUntil
 could've used strchr instead). Simply changing that to:
 
  auto offset = currentBoard.representation.countUntil(ch);
 
 gave an additional 20%-30% performance boost.
 
 I've argued this point for over a year, apparently without convincing
 anyone :-(

I dunno, my impression is that some people agree with it, but consider
that the cost of changing it now a little too prohibitive due to the
extensive code breakage it will cause. But the primary obstacle is that
Andrei is not convinced, so he's going to veto any change in this
direction regardless. You two will have to tussle it out to resolve
this. ;-)


 BTW, use .byChar. instead of .representation. because the former
 leaves it typed as 'char' and the latter converts it to 'ubyte'.

I was going to use .byChar, but since I'm doing performance testing on
gdc, and (my version of) gdc hasn't caught up to the latest Phobos yet,
I have to settle with .representation for now. For my purposes it's
harmless, since I'm not actually doing anything with the ubyte
representation once countUntil is done; the return value is used to
index the original char[].


 3) However, countUntil remains at the top of the profiler's list of
 hotspots. DMD's optimizer was rather disappointing here, so I looked
 at gdc's output instead. Digging into the disassembly, I saw that it
 was using a foreach loop over the char[], but for some reason even
 gdc -O3 failed to simplify that loop significantly (I'm not sure why
 -- maybe what threw off the optimizer is the array indexing + length
 check operation, where in C one would normally jump bump the pointer
 instead?). Eventually, I tried writing a manual loop:
 
  auto ptr = current.ptr;
  auto c = current.length;
  while (c  0  *ptr != m.id)
  ptr++;
  cert.headOffset = ptr - current.ptr;
 
 This pushed gdc far enough in the right direction that it finally
 produced a 3-instruction inner loop. Strangely enough, the assembly
 had no check for (c  0)... could it be because there is an assert
 following the loop claiming that that will never happen, therefore
 gdc elided the check? I thought Walter hasn't gotten around to
 implementing optimizations based on assert yet??  Anyway, with this
 optimization, I managed to shave off another 3%-5% of the total
 running time.
 
 Walter hasn't gotten around to implementing optimizations in GDC,
 that's fer sure! :-)

LOL... I wasn't sure whether said optimizations were going to be in the
front end, or in the backend only. It might not be a bad idea to put it
in the front end where applicable, then all compilers will benefit from
it.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- 
Christopher


Re: Optimization fun

2014-11-06 Thread Meta via Digitalmars-d

On Thursday, 6 November 2014 at 23:30:56 UTC, Walter Bright wrote:

On 11/6/2014 3:13 PM, H. S. Teoh via Digitalmars-d wrote:

mouth.open(); mouth.insert(foot);


Now you've done it. You must pull the Stone of Shame!

https://pbs.twimg.com/media/BrENFq3CMAEqFUL.jpg


The Shame Cube is far more efficient at inflicting shame.

http://stream1.gifsoup.com/view3/2323130/shame-cube-o.gif


Re: Optimization fun

2014-11-06 Thread H. S. Teoh via Digitalmars-d
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote:
 So today, I was playing around with profiling and optimizing my sliding
 block puzzle solver, and found some interesting things:
[...]

Oh, I forgot to mention something else that might be of interest to
people: in the course of optimizing the performance of canFind(), one of
the techniques I used was to reduce the frequency with which it's
called. One such situation was the duplication between the canMove()
method, which checks whether a prospective move is valid, and move(),
which performs the actual move (update the puzzle board).

In more general terms, the same pattern applies to data structures that
have a check method for evaluating whether a particular operation is
legal or not, and another method for actually performing the operation.
For example, when updating a hash table or binary tree you might do a
find to see if an entry already exists, before you insert a new entry.
While check() followed by insert() makes sense logically (first you
check, then you proceed if the check is OK), it performs poorly in code,
because quite often there is a fair amount of overlap in the work done
by either method. The check, for example, may have found the node in
which the insert will take place, but since it has to return to the
caller first, this information is lost and insert() has to recompute
something that it should already have known.

There are a few common solutions to this problem:

1) Have a checkAndInsert() method that both checks and performs the
insertion if the check passes. While this solves the problem, it's ugly
because it conflates two logically distinct operations into one, thus
breaking the principle of single responsibility.

A variation on (1) is to pass a flag (ugh) to insert() to do a check
before inserting. It's essentially the same solution with the same
drawbacks.

2) check() returns a node (or iterator, or range, whatever) instead of a
bool, which is then passed to insert() so it doesn't have to traverse
the structure again. This is a slightly better solution, but the
disadvantage is that you expose the implementation details of the
structure, and besides, why should a method named check return an
iterator? It's also unable to capture other repeated computations
effectively; e.g., check() may have computed things like the balance
factor of the tree at that point, which insert() can profitably reuse,
but an iterator/range doesn't (and shouldn't!) capture such a kind of
information.

3) check() takes an additional ref parameter that can be used to store
extra information to be passed to insert(), then the caller passes this
extra information when it calls insert(). This is also ugly, because it
pollutes user code with implementation details. This can be restricted
to only private use by other methods, so that we don't break
encapsulation, but then user code won't be able to benefit from it.

In the end, the idea occurred to me that the reason insert() can reuse
certain things check() has already computed, is because those things
were computed as part of the process of *verifying* the validity of a
certain operation. Much like verifying the validity of a piece of data
with a checksum, the checksum computed by the verification function is a
proof, or certificate, that the data is valid. So it occurred to me that
check() should return not a bool, but an opaque *certificate* object
that captures the information computed by check() in the process of
validating the user-proposed operation. This certificate object can then
be inspected by the caller to see if the check passed (a default,
invalid certificate is returned if check() fails), and it can also
capture additional information that insert() can use afterwards. Then
insert() itself will have two overloads: one that takes the normal
parameters, which will (re)compute whatever information is necessary to
perform the operation, and another that takes a certificate that
certifies that the given operation has been previously validated, and
therefore insert() can skip recomputation of things already computed by
check() -- which are captured in the certificate object. Furthermore,
being a certificate, it also contains information about the original
request (otherwise you could illegally pair an invalid request with the
certificate of a different, valid request and cause insert() to
malfunction).

So my eventual solution was to change the canMove() method's signature
from:

bool canMove(Move m);

to:

MoveCert canMove(Move m);

And then the overloads of move() are:

void move(Move m) {
// Unvalidated overload: will compute the values needed
// by the second overload of move manually since it's
// not already computed.
MoveCert cert = ... ;

// The actual implementation is second overload; now
// that we have computed the necessary info to proceed,
 

Re: Optimization fun

2014-11-06 Thread H. S. Teoh via Digitalmars-d
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 1) The GC could use some serious improvement: it just so happens that
 the solver's algorithm only ever needs to allocate memory, never
 release it (it keeps a hash of visited states that only grows, never
 shrinks).  The profiler indicated that one of the hotspots is the GC.
 So I added an option to the program to call GC.disable() if a command
 line option is given. The result is a 40%-50% reduction in running
 time.
[...]

Yet more GC drama unfolded tonight: while testing my new GC-disabling
option on several other puzzles, one instance ate up my PC's RAM at an
astonishing rate, causing the system to almost freeze up from the
ensuing thrashing. A little investigation revealed that there are
several instances of excessive GC load caused by unnecessary
(re)allocations.

One case was a function that constructs and returns a new array each
call -- unnecessary because after the resulting array is traversed, it
is discarded. Perfect candidate for a reusable buffer. After
implementing this, the rate of memory consumption is far less
frightening, but still far too rapid compared to when the GC is enabled,
so clearly, *something* else is still doing way more allocations than
necessary.

A little more investigation revealed the culprit: a queue object
implemented with subarrays, and subarrays were being used as stacks by
appending with ~= and popping by decrementing .length. A dangerous
combination that causes excessive reallocations! After inserting
assumeSafeAppend() in the right place, the memory consumption rate
dropped to more reasonable levels. It's still visibly faster than when
GC is enabled, so probably there are other places with excessive
allocations, but this particular change stood out because after adding
assumeSafeAppend() to the code, the solver's performance *even with GC
enabled* improved by about 40%-50%. So this explains why calling
GC.disable() had such a pronounced effect before: the GC is too busy
cleaning up after unnecessarily allocation-hungry code!

The moral of the story, is that code like the following should raise big
red warning flags:

struct MyData(T) {
T[] data;
T pop() {
...
data.length--; // --- yikes
...
}
void push(T t) {
...
data ~= t; // --- yikes
...
}
}

The easy fix (though not a complete one) is to insert a call to
assumeSafeAppend before `data ~= t`. That suppresses the majority of the
unnecessary reallocations, while leaving the necessary ones intact (e.g.
when the GC block runs out of space to store the array).

A more complete fix, of course, is to use a linked-list of presized
blocks, so that the arrays don't keep getting moved around in memory
needlessly.

So while the GC does need to be improved, in this particular case a big
part of the problem is poorly-written user code (by yours truly :-P),
not so much the GC itself. ;-)


T

-- 
Skill without imagination is craftsmanship and gives us many useful objects 
such as wickerwork picnic baskets.  Imagination without skill gives us modern 
art. -- Tom Stoppard