Re: Optimization fun
08-Nov-2014 03:22, Walter Bright пишет: On 11/7/2014 2:58 PM, Dmitry Olshansky wrote: That's the problem with profilers: they say what takes time but not why :) Often I find myself looking at memcpy at the top of the list, so obvious the textbook answer is to optimize memcpy ;) In contrast it should be read as you seem to do excessive copying of data. dmd's profiler will give you a tree showing who called each function how many times. This helps a lot with the why question. This information is very limited esp in case of recursive functions, it's still more of what instead of why. Why always takes a bit of research. Also I'd suggest to use many different profilers. -- Dmitry Olshansky
Re: Optimization fun
On Sat, Nov 08, 2014 at 02:29:25AM +, Kiith-Sa via Digitalmars-d wrote: [...] Except for very specific cases, neither gprof or DMD's profiler are good for profiling. If you're on Linux, you have perf, which works well with D and is way ahead of anything the DMD profiler will be able to do *after* man-years of further development. See (shameless advertisement): http://defenestrate.eu/_static/profiling-slides/index.html#22 (especially perf record/perf report) The slides mention not to use gprof, but never explain why? Also, the recommendations are kinda not what I'm looking for; I'm not looking so much for interactive performance (trying to optimize branch predictions, etc.), because I'm anticipating that in the future, most of the problems I'll be running will be I/O bound because the size of the problems will *far* exceed available RAM. Mispredicted branches or L1/L2 cache misses are kinda irrelevant in this scenario. Or at least, they're not as important as they're made out to be, since what will matter more is how to minimize disk seeks and how to keep frequently-accessed cached disk pages in the CPU cache. So it's still somewhat relevant, just not as much as other applications like interactive games. T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Re: Optimization fun
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.
Re: Optimization fun
On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote: A little more investigation revealed the culprit: a queue object implemented with subarrays, and subarrays were being used as stacks by appending with ~= and popping by decrementing .length. A dangerous combination that causes excessive reallocations! After inserting assumeSafeAppend() in the right place, the memory consumption rate dropped to more reasonable levels. It's still visibly faster than when GC is enabled, so probably there are other places with excessive allocations, but this particular change stood out because after adding assumeSafeAppend() to the code, the solver's performance *even with GC enabled* improved by about 40%-50%. So this explains why calling GC.disable() had such a pronounced effect before: the GC is too busy cleaning up after unnecessarily allocation-hungry code! In D1 this would have worked fine. A queue array wrapper may be a good candidate for Phobos that works just like a normal array but calls assume safe append whenever decrementing length. -Steve
Re: Optimization fun
On 11/7/14 8:43 AM, Steven Schveighoffer wrote: On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote: A little more investigation revealed the culprit: a queue object implemented with subarrays, and subarrays were being used as stacks by appending with ~= and popping by decrementing .length. A dangerous combination that causes excessive reallocations! After inserting assumeSafeAppend() in the right place, the memory consumption rate dropped to more reasonable levels. It's still visibly faster than when GC is enabled, so probably there are other places with excessive allocations, but this particular change stood out because after adding assumeSafeAppend() to the code, the solver's performance *even with GC enabled* improved by about 40%-50%. So this explains why calling GC.disable() had such a pronounced effect before: the GC is too busy cleaning up after unnecessarily allocation-hungry code! In D1 this would have worked fine. A queue array wrapper may be a good candidate for Phobos that works just like a normal array but calls assume safe append whenever decrementing length. Also to note -- even with this, you will create garbage. As the array grows, it may have to relocate to a new block. The old block would be garbage. Once you get large enough, there is a chance that it will just extend into new pages. But if it ever has to reallocate a very large array, then you just created a very large piece of garbage that will not be cleaned up. I'm wondering if said queue array should have an option to forcefully delete the old array. In which case, we'd have to make sure the data cannot be accessed directly, or mark it as unsafe. -Steve
Re: Optimization fun
On Fri, Nov 07, 2014 at 09:28:56AM +, thedeemon via Digitalmars-d wrote: On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements. I realize that, it's just that even after accounting for that, the memory consumption rate is much higher than expected, and sure enough further investigation revealed unnecessary GC load caused by reallocating arrays where (re)using an existing one would suffice. In the case of the function that used to return an array, which would allocate a new array every call (and there are hundreds of thousands of calls for this particular test case), after implementing buffer reuse, even though it does need to reallocate the buffer sometimes when the array runs out of space, it only reallocates 3-4 times for the entire run. I'm not sure what the overhead for the hash table would be, but I'm expecting it shouldn't be more than, say, 50-100 reallocations of the bucket array (I didn't measure this). So it shouldn't add up to *that* much. T -- There is no gravity. The earth sucks.
Re: Optimization fun
On Fri, Nov 07, 2014 at 08:46:29AM -0500, Steven Schveighoffer via Digitalmars-d wrote: On 11/7/14 8:43 AM, Steven Schveighoffer wrote: On 11/7/14 1:05 AM, H. S. Teoh via Digitalmars-d wrote: A little more investigation revealed the culprit: a queue object implemented with subarrays, and subarrays were being used as stacks by appending with ~= and popping by decrementing .length. A dangerous combination that causes excessive reallocations! After inserting assumeSafeAppend() in the right place, the memory consumption rate dropped to more reasonable levels. It's still visibly faster than when GC is enabled, so probably there are other places with excessive allocations, but this particular change stood out because after adding assumeSafeAppend() to the code, the solver's performance *even with GC enabled* improved by about 40%-50%. So this explains why calling GC.disable() had such a pronounced effect before: the GC is too busy cleaning up after unnecessarily allocation-hungry code! In D1 this would have worked fine. A queue array wrapper may be a good candidate for Phobos that works just like a normal array but calls assume safe append whenever decrementing length. Also to note -- even with this, you will create garbage. As the array grows, it may have to relocate to a new block. The old block would be garbage. Once you get large enough, there is a chance that it will just extend into new pages. But if it ever has to reallocate a very large array, then you just created a very large piece of garbage that will not be cleaned up. Good point. Maybe that's where the remaining garbage is coming from: remnants of reallocated arrays. Some of the subarrays I'm using does grow pretty big, so things could add up. I'm wondering if said queue array should have an option to forcefully delete the old array. In which case, we'd have to make sure the data cannot be accessed directly, or mark it as unsafe. [...] Well, if we're going to manually manage the array's memory, then we're back in malloc/free territory along with all of their pitfalls. :-) So I wouldn't sweat it too much -- I'd make the queue a non-copyable reference type (so there would be no aliasing of the queue array) and leave it at that. On another note, I wonder if reducing the frequency of GC collections might help keep memory under control while minimizing the performance hits. As a first stab, I could try to manually trigger a collection every, say, 100,000 iterations of the main loop, which would be about 10 or so collections per run, which should keep memory usage under control while avoiding excessively frequent collections that degrade performance. T -- I speak better English than this villain Bush -- Mohammed Saeed al-Sahaf, Iraqi Minister of Information
Re: Optimization fun
On Fri, Nov 07, 2014 at 09:30:47AM -0800, H. S. Teoh via Digitalmars-d wrote: [...] On another note, I wonder if reducing the frequency of GC collections might help keep memory under control while minimizing the performance hits. As a first stab, I could try to manually trigger a collection every, say, 100,000 iterations of the main loop, which would be about 10 or so collections per run, which should keep memory usage under control while avoiding excessively frequent collections that degrade performance. [...] OK, I did a little test where I would disable the GC at the beginning of the program, then trigger a collection every 250,000 iterations. As a rough visual aid, I print a message before and after the collection so that I can gauge how long each collection is taking. It appears that as the program progresses, collections become longer -- expected, because the hash table is growing rapidly, so the marking phase probably would take longer (I didn't measure this in detail though). Overall, a puzzle that took ~7.0 seconds to solve now takes ~7.6 seconds, which is about 8% performance degradation. That's pretty high, considering that we already know that most of the new allocations between collections is live. Is there a way to tell the GC to ignore the hash table buckets for collection purposes? It would seem that most of the work is being wasted there since we already know it will be live until the end of the program. Another alternative is to manually call GC.free on reallocated arrays in the two places I've identified, and turn off collections completely. This could be a better approach in this case, since we already know where the reallocations happen. Another GC-related question: if the heap grows too large to fit in RAM, and a GC collection cycle is triggered, will this cause thrashing? -- since the GC will have to scan through all the blocks, causing *all* of them to be paged in and paged out again. Of course, this little side-exercise is more of a curiosity than anything else, since I'm planning to replace the hash table with something else that manually manages memory (and page to disk as necessary) anyway, so any optimizations along these lines will be moot. But it's interesting to explore how far one can go with the current system. T -- It is not the employer who pays the wages. Employers only handle the money. It is the customer who pays the wages. -- Henry Ford
Re: Optimization fun
On Fri, Nov 07, 2014 at 10:01:41AM -0800, H. S. Teoh via Digitalmars-d wrote: [...] OK, I did a little test where I would disable the GC at the beginning of the program, then trigger a collection every 250,000 iterations. As a rough visual aid, I print a message before and after the collection so that I can gauge how long each collection is taking. It appears that as the program progresses, collections become longer -- expected, because the hash table is growing rapidly, so the marking phase probably would take longer (I didn't measure this in detail though). Overall, a puzzle that took ~7.0 seconds to solve now takes ~7.6 seconds, which is about 8% performance degradation. That's pretty high, considering that we already know that most of the new allocations between collections is live. [...] P.S. Note, however, that with the GC fully enabled, the average running time is about 9.8 seconds (the 7.0s figure is when the GC is completely disabled and no collections are run, so memory is leaking). That's still a ~25% improvement over the normal GC behaviour, while preventing memory leaks, even if 8% is a little high for a scenario where we already know most of the memory is still live. (One could see an argument for a generational collector here!) So arguably, the frequency of the GC's collections could be tweaked, as Steven mentioned. We could be getting ~25% performance improvement for free! (But this is of course highly dependent on system-specific and program-specific measurements, so I don't know if this will actually improve performance in general. Still, it's an interesting area to explore in terms of improving the performance of the current GC. Where's Andrei with his nice shiny yet-to-be-materialized new GC when we need him??! :-P) T -- 640K ought to be enough -- Bill G. (allegedly), 1984. The Internet is not a primary goal for PC usage -- Bill G., 1995. Linux has no impact on Microsoft's strategy -- Bill G., 1999.
Re: Optimization fun
On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote: So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. 2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost. 3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c 0 *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time. On that note, at one point I was wondering why gdc -O3 didn't generate a rep scasb for the inner loop instead of manually incrementing the loop pointer; but a little research revealed that I'm about 6-7 years out of date: since about 2008 gcc's optimizer has not used rep scasb because on modern hardware it has atrocious performance -- much worse than a manually-written C loop! So much for built-in string processing that used to be touted in the old days. Yet another proof of the rule that overly-complex CPU instruction sets rarely actually perform better. Anyway, after everything is said and done, a puzzle that used to take about 20 seconds to solve now only takes about 6 seconds. W00t! T Hi, Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D. Thanks!
Re: Optimization fun
On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d wrote: [...] Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D. [...] I haven't posted the code anywhere yet, but I could certainly send you a copy if you want. However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here). T -- Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day and hour soon are coming / When all the IT folks say Gosh! / It isn't from a clever lawsuit / That Windowsland will finally fall, / But thousands writing open source code / Like mice who nibble through a wall. -- The Linux-nationale by Greg Baker
Re: Optimization fun
On 11/7/14, 4:16 PM, H. S. Teoh via Digitalmars-d wrote: On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d wrote: [...] Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D. [...] I haven't posted the code anywhere yet, but I could certainly send you a copy if you want. However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here). T Then I don't think I'll be able to do any useful benchmark/profile with that. Thanks anyway :-)
Re: Optimization fun
07-Nov-2014 09:05, H. S. Teoh via Digitalmars-d пишет: On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote: [...] 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. [...] Yet more GC drama unfolded tonight: while testing my new GC-disabling option on several other puzzles, one instance ate up my PC's RAM at an astonishing rate, causing the system to almost freeze up from the ensuing thrashing. A little investigation revealed that there are several instances of excessive GC load caused by unnecessary (re)allocations. That's the problem with profilers: they say what takes time but not why :) Often I find myself looking at memcpy at the top of the list, so obvious the textbook answer is to optimize memcpy ;) In contrast it should be read as you seem to do excessive copying of data. The moral of the story, is that code like the following should raise big red warning flags: struct MyData(T) { T[] data; T pop() { ... data.length--; // --- yikes ... } void push(T t) { ... data ~= t; // --- yikes ... } } Well known in the inner circles performance bug. Sadly most of D programmers are bound to hit it once in the while. Memory safety by default has its costs. Though I'd just call mmap, far more predictable then these pesky memory allocators + no need to reallocate if reserved range is large enough. -- Dmitry Olshansky
Re: Optimization fun
On Sat, Nov 08, 2014 at 01:58:31AM +0300, Dmitry Olshansky via Digitalmars-d wrote: 07-Nov-2014 09:05, H. S. Teoh via Digitalmars-d пишет: [...] Yet more GC drama unfolded tonight: while testing my new GC-disabling option on several other puzzles, one instance ate up my PC's RAM at an astonishing rate, causing the system to almost freeze up from the ensuing thrashing. A little investigation revealed that there are several instances of excessive GC load caused by unnecessary (re)allocations. That's the problem with profilers: they say what takes time but not why :) True! But it's better than the stereotypical C veteran's approach of I know from 50 years' experience (aka personal bias) that X is faster than Y because it's more terse, so I'm gonna optimize all my code into unreadable code from the get-go even if 99% of it isn't the real hotspot anyway. Often I find myself looking at memcpy at the top of the list, so obvious the textbook answer is to optimize memcpy ;) In contrast it should be read as you seem to do excessive copying of data. [...] Well, when I saw countUntil at the top of the list, it wasn't immediately obvious *why* it was at the top of the list. In the grand scheme of things, it's just a small cog in a large machinery filled with many other things that could easily be considered far better candidates for performance bottlenecks. And while optimizing countUntil itself did yield significant improvement, a deeper investigation led to the realization that the underlying logic of the code was causing it to call countUntil more often than it needs to. Fixing this issue drastically reduced the number of calls to countUntil, which yielded no small overall improvement. In any case, the old optimization adage applies: you should optimize the underlying algorithms (i.e., the big-O's) before you spend time optimizing the low-level primitives (the scalar factors). No matter how much you optimize the heck out of the scalar factor, an O(N^2) algorithm is still an O(N^2) algorithm. As long as the algorithm remains O(N^2), you aren't going to get much further. T -- Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.
Re: Optimization fun
On 11/7/2014 2:58 PM, Dmitry Olshansky wrote: That's the problem with profilers: they say what takes time but not why :) Often I find myself looking at memcpy at the top of the list, so obvious the textbook answer is to optimize memcpy ;) In contrast it should be read as you seem to do excessive copying of data. dmd's profiler will give you a tree showing who called each function how many times. This helps a lot with the why question.
Re: Optimization fun
On Fri, Nov 07, 2014 at 04:22:49PM -0800, Walter Bright via Digitalmars-d wrote: On 11/7/2014 2:58 PM, Dmitry Olshansky wrote: That's the problem with profilers: they say what takes time but not why :) Often I find myself looking at memcpy at the top of the list, so obvious the textbook answer is to optimize memcpy ;) In contrast it should be read as you seem to do excessive copying of data. dmd's profiler will give you a tree showing who called each function how many times. This helps a lot with the why question. gprof also shows that in the section section of its output. But speaking of which, I found dmd -profile's output in trace.log a little difficult to understand because of the lack of self-documenting headers in the first section. gprof, for example, produces nicely-labelled output that describes what the data means. Would you accept a PR along these lines? (Or is there a post-processing step that I'm missing?) Also, in the second section, I'm getting some negative numbers for the call times, which seem to indicate integer overflow? This basically destroys the usefulness of dmd -profile for my test cases, since most of the interesting cases for me are those with long running times, which for sure will run into integer overflow at the sample rate that dmd -profile is using, causing the second section to be essentially randomly sorted. T -- He who does not appreciate the beauty of language is not worthy to bemoan its flaws.
Re: Optimization fun
On Fri, Nov 07, 2014 at 04:41:27PM -0800, H. S. Teoh via Digitalmars-d wrote: [...] gprof also shows that in the section section of its output. [...] ^^^ Argh, I mean *second* section (which, incidentally, corresponds with the first section of trace.log in the output generated by dmd -profile). Dyslexic today. :-( T -- Study gravitation, it's a field with a lot of potential.
Re: Optimization fun
On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote: On Fri, Nov 07, 2014 at 04:22:49PM -0800, Walter Bright via Digitalmars-d wrote: On 11/7/2014 2:58 PM, Dmitry Olshansky wrote: That's the problem with profilers: they say what takes time but not why :) Often I find myself looking at memcpy at the top of the list, so obvious the textbook answer is to optimize memcpy ;) In contrast it should be read as you seem to do excessive copying of data. dmd's profiler will give you a tree showing who called each function how many times. This helps a lot with the why question. gprof also shows that in the section section of its output. But speaking of which, I found dmd -profile's output in trace.log a little difficult to understand because of the lack of self-documenting headers in the first section. gprof, for example, produces nicely-labelled output that describes what the data means. Would you accept a PR along these lines? I don't really know what you have in mind, so I'll have to say it depends. Keep in mind that subsequent runs of the profiler parse the previous output file and add it to the results. Also, in the second section, I'm getting some negative numbers for the call times, which seem to indicate integer overflow? This basically destroys the usefulness of dmd -profile for my test cases, since most of the interesting cases for me are those with long running times, which for sure will run into integer overflow at the sample rate that dmd -profile is using, causing the second section to be essentially randomly sorted. Yeah, a negative number is likely an overflow. Reduce the test case!
Re: Optimization fun
On Fri, Nov 07, 2014 at 05:31:44PM -0800, Walter Bright via Digitalmars-d wrote: On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote: [...] But speaking of which, I found dmd -profile's output in trace.log a little difficult to understand because of the lack of self-documenting headers in the first section. gprof, for example, produces nicely-labelled output that describes what the data means. Would you accept a PR along these lines? I don't really know what you have in mind, so I'll have to say it depends. Keep in mind that subsequent runs of the profiler parse the previous output file and add it to the results. Just add a simple header that describes what each column means. Also, in the second section, I'm getting some negative numbers for the call times, which seem to indicate integer overflow? This basically destroys the usefulness of dmd -profile for my test cases, since most of the interesting cases for me are those with long running times, which for sure will run into integer overflow at the sample rate that dmd -profile is using, causing the second section to be essentially randomly sorted. Yeah, a negative number is likely an overflow. Reduce the test case! Unfortunately, I can't. The behaviour I'm trying to profile is the long-term running case, because there's a whole bunch of setup at the beginning that initially takes a while, but eventually it's the long-running part that dominates the runtime behaviour. Reducing the test case will only increase the initial noise, which I'm not interested in, and reduce the running time of the main loop of interest. Besides, what I'm trying to optimize for is the large, complex case; the simple cases already run fast enough that the performance characteristics are not that interesting to me. It's the long-term part that's interesting because that's when the GC starts kicking in and pressure on system RAM starts to increase. I'm surprised that dmd's profiler can't even handle something that only runs for 7-8 seconds or so! gprof certainly has no such limitation. Is it relatively simple to make dmd -profile use larger integer widths for profiling? If not, I'm afraid I'm just gonna have to stick with gdc/gprof instead. T -- Your inconsistency is the only consistent thing about you! -- KD
Re: Optimization fun
On 11/7/2014 5:51 PM, H. S. Teoh via Digitalmars-d wrote: I'm surprised that dmd's profiler can't even handle something that only runs for 7-8 seconds or so! It's based on a design I wrote decades ago, when machines were a lot slower. Is it relatively simple to make dmd -profile use larger integer widths for profiling? I don't know. Haven't looked at it for a while.
Re: Optimization fun
On Saturday, 8 November 2014 at 01:53:33 UTC, H. S. Teoh via Digitalmars-d wrote: On Fri, Nov 07, 2014 at 05:31:44PM -0800, Walter Bright via Digitalmars-d wrote: On 11/7/2014 4:41 PM, H. S. Teoh via Digitalmars-d wrote: [...] But speaking of which, I found dmd -profile's output in trace.log a little difficult to understand because of the lack of self-documenting headers in the first section. gprof, for example, produces nicely-labelled output that describes what the data means. Would you accept a PR along these lines? I don't really know what you have in mind, so I'll have to say it depends. Keep in mind that subsequent runs of the profiler parse the previous output file and add it to the results. Just add a simple header that describes what each column means. Also, in the second section, I'm getting some negative numbers for the call times, which seem to indicate integer overflow? This basically destroys the usefulness of dmd -profile for my test cases, since most of the interesting cases for me are those with long running times, which for sure will run into integer overflow at the sample rate that dmd -profile is using, causing the second section to be essentially randomly sorted. Yeah, a negative number is likely an overflow. Reduce the test case! Unfortunately, I can't. The behaviour I'm trying to profile is the long-term running case, because there's a whole bunch of setup at the beginning that initially takes a while, but eventually it's the long-running part that dominates the runtime behaviour. Reducing the test case will only increase the initial noise, which I'm not interested in, and reduce the running time of the main loop of interest. Besides, what I'm trying to optimize for is the large, complex case; the simple cases already run fast enough that the performance characteristics are not that interesting to me. It's the long-term part that's interesting because that's when the GC starts kicking in and pressure on system RAM starts to increase. I'm surprised that dmd's profiler can't even handle something that only runs for 7-8 seconds or so! gprof certainly has no such limitation. Is it relatively simple to make dmd -profile use larger integer widths for profiling? If not, I'm afraid I'm just gonna have to stick with gdc/gprof instead. T Except for very specific cases, neither gprof or DMD's profiler are good for profiling. If you're on Linux, you have perf, which works well with D and is way ahead of anything the DMD profiler will be able to do *after* man-years of further development. See (shameless advertisement): http://defenestrate.eu/_static/profiling-slides/index.html#22 (especially perf record/perf report)
Optimization fun
So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. 2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost. 3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c 0 *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time. On that note, at one point I was wondering why gdc -O3 didn't generate a rep scasb for the inner loop instead of manually incrementing the loop pointer; but a little research revealed that I'm about 6-7 years out of date: since about 2008 gcc's optimizer has not used rep scasb because on modern hardware it has atrocious performance -- much worse than a manually-written C loop! So much for built-in string processing that used to be touted in the old days. Yet another proof of the rule that overly-complex CPU instruction sets rarely actually perform better. Anyway, after everything is said and done, a puzzle that used to take about 20 seconds to solve now only takes about 6 seconds. W00t! T -- Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
Re: Optimization fun
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote: [...] auto ptr = current.ptr; auto c = current.length; while (c 0 *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? [...] Argh... the reason c0 had no check was because the code is wrong: c is never decremented. :-( Replacing (c 0) with (c-- 0) reduced the performance improvement to about 3% instead of 4%-5%. Oh well. :-/ mouth.open(); mouth.insert(foot); T -- I am a consultant. My job is to make your job redundant. -- Mr Tom
Re: Optimization fun
On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote: 2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost. I've argued this point for over a year, apparently without convincing anyone :-( BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'. 3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c 0 *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time. Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)
Re: Optimization fun
On 11/6/2014 3:13 PM, H. S. Teoh via Digitalmars-d wrote: mouth.open(); mouth.insert(foot); Now you've done it. You must pull the Stone of Shame! https://pbs.twimg.com/media/BrENFq3CMAEqFUL.jpg
Re: Optimization fun
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d wrote: So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. We've talked before about keeping statistics of how much memory has been reclaimed by collections, and use that to influence whether a collection should be run before going to the OS for more. Something like this should definitely get a bugzilla ticket. For now, you might want to consider guessing at how much memory you'll need and calling GC.reserve() as well. This should improve the performance of your app even further. 2) Auto-decoding is evil Yep, it's the worst. Most string operations don't even need auto-decoding to work properly. I honestly have no idea why we have this feature.
Re: Optimization fun
On Thu, Nov 06, 2014 at 03:15:04PM -0800, Walter Bright via Digitalmars-d wrote: On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote: 2) Auto-decoding is evil: the solver currently uses a naïve char[] representation for the puzzle board, and uses countUntil() extensively to locate things on the board. The original code had: auto offset = currentBoard.countUntil(ch); which, of course, triggers autodecoding (even though there is actually no reason to do so: ch is a char, and therefore countUntil could've used strchr instead). Simply changing that to: auto offset = currentBoard.representation.countUntil(ch); gave an additional 20%-30% performance boost. I've argued this point for over a year, apparently without convincing anyone :-( I dunno, my impression is that some people agree with it, but consider that the cost of changing it now a little too prohibitive due to the extensive code breakage it will cause. But the primary obstacle is that Andrei is not convinced, so he's going to veto any change in this direction regardless. You two will have to tussle it out to resolve this. ;-) BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'. I was going to use .byChar, but since I'm doing performance testing on gdc, and (my version of) gdc hasn't caught up to the latest Phobos yet, I have to settle with .representation for now. For my purposes it's harmless, since I'm not actually doing anything with the ubyte representation once countUntil is done; the return value is used to index the original char[]. 3) However, countUntil remains at the top of the profiler's list of hotspots. DMD's optimizer was rather disappointing here, so I looked at gdc's output instead. Digging into the disassembly, I saw that it was using a foreach loop over the char[], but for some reason even gdc -O3 failed to simplify that loop significantly (I'm not sure why -- maybe what threw off the optimizer is the array indexing + length check operation, where in C one would normally jump bump the pointer instead?). Eventually, I tried writing a manual loop: auto ptr = current.ptr; auto c = current.length; while (c 0 *ptr != m.id) ptr++; cert.headOffset = ptr - current.ptr; This pushed gdc far enough in the right direction that it finally produced a 3-instruction inner loop. Strangely enough, the assembly had no check for (c 0)... could it be because there is an assert following the loop claiming that that will never happen, therefore gdc elided the check? I thought Walter hasn't gotten around to implementing optimizations based on assert yet?? Anyway, with this optimization, I managed to shave off another 3%-5% of the total running time. Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-) LOL... I wasn't sure whether said optimizations were going to be in the front end, or in the backend only. It might not be a bad idea to put it in the front end where applicable, then all compilers will benefit from it. T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
Re: Optimization fun
On Thursday, 6 November 2014 at 23:30:56 UTC, Walter Bright wrote: On 11/6/2014 3:13 PM, H. S. Teoh via Digitalmars-d wrote: mouth.open(); mouth.insert(foot); Now you've done it. You must pull the Stone of Shame! https://pbs.twimg.com/media/BrENFq3CMAEqFUL.jpg The Shame Cube is far more efficient at inflicting shame. http://stream1.gifsoup.com/view3/2323130/shame-cube-o.gif
Re: Optimization fun
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote: So today, I was playing around with profiling and optimizing my sliding block puzzle solver, and found some interesting things: [...] Oh, I forgot to mention something else that might be of interest to people: in the course of optimizing the performance of canFind(), one of the techniques I used was to reduce the frequency with which it's called. One such situation was the duplication between the canMove() method, which checks whether a prospective move is valid, and move(), which performs the actual move (update the puzzle board). In more general terms, the same pattern applies to data structures that have a check method for evaluating whether a particular operation is legal or not, and another method for actually performing the operation. For example, when updating a hash table or binary tree you might do a find to see if an entry already exists, before you insert a new entry. While check() followed by insert() makes sense logically (first you check, then you proceed if the check is OK), it performs poorly in code, because quite often there is a fair amount of overlap in the work done by either method. The check, for example, may have found the node in which the insert will take place, but since it has to return to the caller first, this information is lost and insert() has to recompute something that it should already have known. There are a few common solutions to this problem: 1) Have a checkAndInsert() method that both checks and performs the insertion if the check passes. While this solves the problem, it's ugly because it conflates two logically distinct operations into one, thus breaking the principle of single responsibility. A variation on (1) is to pass a flag (ugh) to insert() to do a check before inserting. It's essentially the same solution with the same drawbacks. 2) check() returns a node (or iterator, or range, whatever) instead of a bool, which is then passed to insert() so it doesn't have to traverse the structure again. This is a slightly better solution, but the disadvantage is that you expose the implementation details of the structure, and besides, why should a method named check return an iterator? It's also unable to capture other repeated computations effectively; e.g., check() may have computed things like the balance factor of the tree at that point, which insert() can profitably reuse, but an iterator/range doesn't (and shouldn't!) capture such a kind of information. 3) check() takes an additional ref parameter that can be used to store extra information to be passed to insert(), then the caller passes this extra information when it calls insert(). This is also ugly, because it pollutes user code with implementation details. This can be restricted to only private use by other methods, so that we don't break encapsulation, but then user code won't be able to benefit from it. In the end, the idea occurred to me that the reason insert() can reuse certain things check() has already computed, is because those things were computed as part of the process of *verifying* the validity of a certain operation. Much like verifying the validity of a piece of data with a checksum, the checksum computed by the verification function is a proof, or certificate, that the data is valid. So it occurred to me that check() should return not a bool, but an opaque *certificate* object that captures the information computed by check() in the process of validating the user-proposed operation. This certificate object can then be inspected by the caller to see if the check passed (a default, invalid certificate is returned if check() fails), and it can also capture additional information that insert() can use afterwards. Then insert() itself will have two overloads: one that takes the normal parameters, which will (re)compute whatever information is necessary to perform the operation, and another that takes a certificate that certifies that the given operation has been previously validated, and therefore insert() can skip recomputation of things already computed by check() -- which are captured in the certificate object. Furthermore, being a certificate, it also contains information about the original request (otherwise you could illegally pair an invalid request with the certificate of a different, valid request and cause insert() to malfunction). So my eventual solution was to change the canMove() method's signature from: bool canMove(Move m); to: MoveCert canMove(Move m); And then the overloads of move() are: void move(Move m) { // Unvalidated overload: will compute the values needed // by the second overload of move manually since it's // not already computed. MoveCert cert = ... ; // The actual implementation is second overload; now // that we have computed the necessary info to proceed,
Re: Optimization fun
On Thu, Nov 06, 2014 at 02:58:16PM -0800, H. S. Teoh via Digitalmars-d wrote: [...] 1) The GC could use some serious improvement: it just so happens that the solver's algorithm only ever needs to allocate memory, never release it (it keeps a hash of visited states that only grows, never shrinks). The profiler indicated that one of the hotspots is the GC. So I added an option to the program to call GC.disable() if a command line option is given. The result is a 40%-50% reduction in running time. [...] Yet more GC drama unfolded tonight: while testing my new GC-disabling option on several other puzzles, one instance ate up my PC's RAM at an astonishing rate, causing the system to almost freeze up from the ensuing thrashing. A little investigation revealed that there are several instances of excessive GC load caused by unnecessary (re)allocations. One case was a function that constructs and returns a new array each call -- unnecessary because after the resulting array is traversed, it is discarded. Perfect candidate for a reusable buffer. After implementing this, the rate of memory consumption is far less frightening, but still far too rapid compared to when the GC is enabled, so clearly, *something* else is still doing way more allocations than necessary. A little more investigation revealed the culprit: a queue object implemented with subarrays, and subarrays were being used as stacks by appending with ~= and popping by decrementing .length. A dangerous combination that causes excessive reallocations! After inserting assumeSafeAppend() in the right place, the memory consumption rate dropped to more reasonable levels. It's still visibly faster than when GC is enabled, so probably there are other places with excessive allocations, but this particular change stood out because after adding assumeSafeAppend() to the code, the solver's performance *even with GC enabled* improved by about 40%-50%. So this explains why calling GC.disable() had such a pronounced effect before: the GC is too busy cleaning up after unnecessarily allocation-hungry code! The moral of the story, is that code like the following should raise big red warning flags: struct MyData(T) { T[] data; T pop() { ... data.length--; // --- yikes ... } void push(T t) { ... data ~= t; // --- yikes ... } } The easy fix (though not a complete one) is to insert a call to assumeSafeAppend before `data ~= t`. That suppresses the majority of the unnecessary reallocations, while leaving the necessary ones intact (e.g. when the GC block runs out of space to store the array). A more complete fix, of course, is to use a linked-list of presized blocks, so that the arrays don't keep getting moved around in memory needlessly. So while the GC does need to be improved, in this particular case a big part of the problem is poorly-written user code (by yours truly :-P), not so much the GC itself. ;-) T -- Skill without imagination is craftsmanship and gives us many useful objects such as wickerwork picnic baskets. Imagination without skill gives us modern art. -- Tom Stoppard