[Issue 5623] Slow GC with large heaps

2011-08-27 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


David Simcha dsim...@yahoo.com changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED


--- Comment #17 from David Simcha dsim...@yahoo.com 2011-08-27 07:29:44 PDT 
---
I'm marking this as resolved.  My pull request was merged a long time ago and
Steve's issue is probably related to false pointers.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-24 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #15 from David Simcha dsim...@yahoo.com 2011-02-24 17:56:20 PST 
---
https://github.com/dsimcha/druntime/wiki/Druntime-GC-Optimization-Fork

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-24 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #16 from Steven Schveighoffer schvei...@yahoo.com 2011-02-24 
18:45:01 PST ---
(In reply to comment #15)
 https://github.com/dsimcha/druntime/wiki/Druntime-GC-Optimization-Fork

from that wiki page:

Also note that a Tree2 benchmark also exists, but it seems to run in either 12
or 0 seconds, randomly, no matter what patches are applied, for reasons I don't
understand.

this pertains to bug 5650

I have seen the same anomaly (although mine must be slower hardware, it varies
between 20+ and .9 seconds).

My theory is that false pointers are keeping the elements in memory, so the GC
never frees them.  It is *definitely* the GC's fullCollect that is causing the
slowdown, because while debugging (and printing out every 100 loops), you can
ctrl-c to pause, and it's always in the collect cycle.

Basically, the program is so deterministic, that the only think I can think of
that changes between good and bad runs is the address space given to the heap
by the OS.

I sort of tested this theory by adding this line to the code:

writeln((new int[1]).ptr);

Here are the results of running a bunch of times (the times follow the address
printout):

112E40

real0m26.723s
user0m26.554s
sys0m0.052s
A50E40

real0m0.911s
user0m0.908s
sys0m0.000s
26FE40

real0m23.852s
user0m23.737s
sys0m0.096s
112E40

real0m20.139s
user0m20.065s
sys0m0.040s
58BE40

real0m19.932s
user0m19.841s
sys0m0.080s
EBDE40

real0m0.897s
user0m0.880s
sys0m0.012s
724E40

real0m25.801s
user0m25.762s
sys0m0.024s
3F2E40

real0m0.907s
user0m0.904s
sys0m0.000s
AC9E40

real0m0.891s
user0m0.884s
sys0m0.000s
DA4E40

real0m0.906s
user0m0.888s
sys0m0.016s
26FE40

real0m29.869s
user0m29.770s
sys0m0.084s
799E40

real0m0.900s
user0m0.896s
sys0m0.000s
58DE40

real0m39.999s
user0m39.802s
sys0m0.152s
138E40

real0m34.000s
user0m33.906s
sys0m0.032s
65CE40

real0m19.246s
user0m19.201s
sys0m0.032s
1B0E40

real0m28.394s
user0m28.350s
sys0m0.028s
D62E40

real0m0.910s
user0m0.900s
sys0m0.008s
AB6E40

real0m0.904s
user0m0.904s
sys0m0.000s
26FE40

real0m38.978s
user0m38.834s
sys0m0.124s
367E40

real0m27.100s
user0m27.010s
sys0m0.076s
9DEE40

real0m0.899s
user0m0.888s
sys0m0.004s
112E40

real0m40.536s
user0m40.419s
sys0m0.088s
401E40

real0m0.901s
user0m0.896s
sys0m0.000s
A18E40

real0m0.911s
user0m0.900s
sys0m0.004s
7A1E40

real0m0.908s
user0m0.904s
sys0m0.004s
112E40

real0m26.441s
user0m26.330s
sys0m0.100s
611E40

real0m23.135s
user0m23.041s
sys0m0.068s
3D7E40

real0m0.905s
user0m0.900s
sys0m0.000s
138E40

real0m38.311s
user0m38.242s
sys0m0.044s
112E40

real0m24.372s
user0m24.314s
sys0m0.028s
270E40

real0m34.142s
user0m33.998s
sys0m0.092s
9ACE40

real0m0.911s
user0m0.908s
sys0m0.004s
C8DE40

real0m0.898s
user0m0.892s
sys0m0.000s
284E40

real0m20.744s
user0m20.621s
sys0m0.096s
3E0E40

real0m0.910s
user0m0.900s
sys0m0.004s
386E40

real0m20.044s
user0m19.921s
sys0m0.108s

Most of the time, the smaller the number, the more likely it is to run slowly. 
There are, however, some outliers.

What does this data mean?  It may mean nothing, but it does seem to strongly
correlate with address selection.  I can't think of any other reason the code
would be so drastically different from one run to the next.  Does this mean
false pointers are the problem?  Not sure, but it's all I can think of for now.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-23 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


Steven Schveighoffer schvei...@yahoo.com changed:

   What|Removed |Added

 CC||schvei...@yahoo.com


--- Comment #11 from Steven Schveighoffer schvei...@yahoo.com 2011-02-23 
20:10:03 PST ---
Cursory glance at the patch, it looks like it won't affect array appending.

BTW, I had a very similar thought with storing the value to be able to jump
back to find the PAGEPLUS start a while ago, but I thought of a different
method.

First, the Bins value is already stored for every page, it's an int, and we're
using exactly 13 of the 4 billion possible values.  My idea was to remove
B_PAGEPLUS from the enum.  If the Bins value was anything other than the given
enums, it would be a number of pages to jump back + B_MAX.

This saves having to keep/update a separate array.

In addition, your statement that we only get 16 TB of space doesn't matter.  It
means the *jump size* is 16 TB.  That is, if you exceed 16 TB of space for a
block, then you just store the maximum.  The algorithm just has to be adjusted
to jump back that amount, then check the page at that location (which will also
know how far to jump back), and continue on.

Can you imagine how awesome the performance would be on a system with a 16TB
block with the linear search? ;)

I think this patch should be applied (will be voting shortly).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-23 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


Steven Schveighoffer schvei...@yahoo.com changed:

   What|Removed |Added

 AssignedTo|nob...@puremagic.com|s...@invisibleduck.org


--- Comment #12 from Steven Schveighoffer schvei...@yahoo.com 2011-02-23 
20:12:09 PST ---
For some reason was marked as assigned, but not to anyone.

I'm guessing you wanted it assigned to you Sean, since you change the bug to
Assigned?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-23 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #13 from David Simcha dsim...@yahoo.com 2011-02-23 20:23:23 PST 
---
One logistical point:  Since I rebuilt my mental model of how the GC works,
I've come up with a few other small ideas for optimization.  These don't have
nearly the impact that this patch does (they only have an effect of a few
percent).  I think we should make this patch a high priority for the next
release, since the effect is huge.  For the smaller optimizations, I'll fork
the druntime git and commit them as I get around to testing and benchmarking,
and we can merge then back later.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-23 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #14 from David Simcha dsim...@yahoo.com 2011-02-23 20:26:47 PST 
---
(In reply to comment #11)
 In addition, your statement that we only get 16 TB of space doesn't matter.  
 It
 means the *jump size* is 16 TB.  That is, if you exceed 16 TB of space for a
 block, then you just store the maximum.  The algorithm just has to be adjusted
 to jump back that amount, then check the page at that location (which will 
 also
 know how far to jump back), and continue on.
 

I'd rather just assume that, in the near future, noone is ever going to
allocate 16 TB in one allocation and in the distant future, we can switch to
size_t, though hopefully we'll have a better GC by the time anyone has enough
RAM to allocate 16 TB in one allocation.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-22 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


Sean Kelly s...@invisibleduck.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||s...@invisibleduck.org


--- Comment #10 from Sean Kelly s...@invisibleduck.org 2011-02-22 15:01:51 
PST ---
I think the separation of pools for large and small allocations is a good
thing.  In fact, the current collector will return entirely free pools to the
OS at the end of a collection cycle, so the two are already logically separate.
 I can't think of a case where performance would be worse than before, but I'll
give the patch a once-over to be sure.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-21 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #9 from David Simcha dsim...@yahoo.com 2011-02-21 16:12:54 PST ---
Here's another benchmark.  This one's designed more to be similar to reasonably
common scientific computing/large allocation intensive use cases with
moderately large overall heap size, rather than to highlight the specific
performance problem at hand.

import std.random, core.memory, std.datetime, std.stdio;

enum nIter = 1000;

void main() {
auto ptrs = new void*[1024];

auto sw = StopWatch(autoStart);

// Allocate 1024 large blocks with size uniformly distributed between 1
// and 128 kilobytes.
foreach(i; 0..nIter) {
foreach(ref ptr; ptrs) {
ptr = GC.malloc(uniform(1024, 128 * 1024 + 1), GC.BlkAttr.NO_SCAN);
}
}

writefln(Done %s iter in %s milliseconds., nIter, sw.peek.msecs);
}

With patch:

Done 1000 iter in 7410 milliseconds.

Without patch:

Done 1000 iter in 38737 milliseconds.

Memory usage is about the same (based on looking at task manager):  About 88
megs w/ patch, about 92 without. 

To verify that this patch doesn't have deleterious effects when allocating lots
of objects close to the border between large and small, I also tried it
with uniform(1024, 8 * 1024 + 1) and got:

With patch:

Done 1000 iter in 2122 milliseconds.

Without patch:

Done 1000 iter in 4153 milliseconds.

The relatively small difference here isn't surprising, as there are no huge
allocations being done (max size is 8k).  Again, the difference in memory usage
was negligible (within epsilon of 12 MB for both).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #1 from David Simcha dsim...@yahoo.com 2011-02-20 12:12:02 PST ---
Created an attachment (id=919)
The whole gcx.d file, in case you have trouble applying the patch.

Here's the whole gcx.d file, in case you have trouble applying the patch. 
(These things never quite work for me.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


Walter Bright bugzi...@digitalmars.com changed:

   What|Removed |Added

 CC||bugzi...@digitalmars.com


--- Comment #2 from Walter Bright bugzi...@digitalmars.com 2011-02-20 
13:03:57 PST ---
More explanation from David from the n.g.:

I've found a way to speed up the GC massively on large heaps without excessive
ripple effects.  Technically it's still O(N), but with about a hundred fold
smaller constant in the case of large heaps with most stuff not scanned.  Now,
I think the O(N) (where N is the total size of the heap) term has such a small
constant that it's for almost all practcal purposes the GC is O(S) (where S is
the size of the scanned portion of the heap).  It also no longer has any O(N^2)
pathological case (which I had discovered while reading the code).

So far all unittests for Phobos, dstats and std.parallelism/parallelfuture pass
with this enabled.  Please test some other code so we can wring out the corner
cases in time for the next release.

Basically all I did was diverge the Pool struct slightly into large and small
object sub-varieties.  The large object sub-variety is used to allocate objects
of at least a page.  It only stores gcbits at page-size offsets, and tracks the
offsets of B_PAGEPLUS bins from the nearest B_PAGE bin so that they can be
found in O(1).

I also added a field to the Pool struct so that the number of free pages in a
pool can be tracked in O(1).  This should drastically lessen the time it takes
to perform large allocations on large heaps.  Right now a free memory region is
found by a linear search through the pools in the case of large allocations. 
Unfortunately, I don't see any easy way to fix this.  This patch at least
allows short circuiting a large number of pools, if there isn't enough free
space in the whole pool, let alone contiguous space.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


bearophile_h...@eml.cc changed:

   What|Removed |Added

 CC||bearophile_h...@eml.cc


--- Comment #3 from bearophile_h...@eml.cc 2011-02-20 13:19:15 PST ---
A very simple benchmark, in D and Java. You may use the D version with and
without your patch, and then the Java version too, and show the three timings,
with N = 15 or 18 or more:

http://codepad.org/yMGK34cb
http://codepad.org/DIOeYn6p

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #4 from David Simcha dsim...@yahoo.com 2011-02-20 14:48:45 PST ---
(In reply to comment #3)
 A very simple benchmark, in D and Java. You may use the D version with and
 without your patch, and then the Java version too, and show the three timings,
 with N = 15 or 18 or more:
 
 http://codepad.org/yMGK34cb
 http://codepad.org/DIOeYn6p

I didn't both(In reply to comment #3)
 A very simple benchmark, in D and Java. You may use the D version with and
 without your patch, and then the Java version too, and show the three timings,
 with N = 15 or 18 or more:
 
 http://codepad.org/yMGK34cb
 http://codepad.org/DIOeYn6p

Using n = 18:

D (with patch):  62 seconds
D (without patch):  68 seconds
Java:  6 seconds

I'm not sure what this told us that we didn't already know, though.  We already
knew Java's GC is much better than D's.  My patch is meant to address issues
with large object allocation, not small object.  (Large is defined as at least
one full page.  If no large objects are allocated for the duration of the
program, then my patch has virtually no effect.)  This benchmark does tons of
small object allocation and no large object allocation.  The small difference
between with and without patch may even just be noise.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #5 from bearophile_h...@eml.cc 2011-02-20 15:07:17 PST ---
(In reply to comment #4)

 I'm not sure what this told us that we didn't already know, though.

Thank you for the timings. This synthetic benchmark tells us something
important: that for a different situation (but an important one, because that
benchmark is quite revealing, despite being so simple) your patch doesn't make
the GC significantly worse (it may even improve things a bit).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #6 from bearophile_h...@eml.cc 2011-02-20 15:11:18 PST ---
(In reply to comment #4)

 We already knew Java's GC is much better than D's.

The purpose of a 3-legged comparison: the Java timing is used as a rough
zero-scale, to allow an estimate of the absolute difference between the other
two timings.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623


David Simcha dsim...@yahoo.com changed:

   What|Removed |Added

 Attachment #918 is|0   |1
   obsolete||


--- Comment #7 from David Simcha dsim...@yahoo.com 2011-02-20 18:29:08 PST ---
Created an attachment (id=920)
New patch.

Here's a new patch that adds one small but important optimization that I
forgot.  Now that we're storing the number of pages for B_PAGE stuff, we can
skip over large blocks in O(1) time when doing allocPages().

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---


[Issue 5623] Slow GC with large heaps

2011-02-20 Thread d-bugmail
http://d.puremagic.com/issues/show_bug.cgi?id=5623



--- Comment #8 from David Simcha dsim...@yahoo.com 2011-02-20 18:29:43 PST ---
Created an attachment (id=921)
New gcx.d

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
--- You are receiving this mail because: ---