Re: rt_finalize WTFs?

2011-12-09 Thread bearophile
dsimcha:

 I'm at my traditional passtime of trying to speed up D's garbage 
 collector again,

Are you going to create a pull request soon? So people will be able to try it 
with their D programs and spot potential troubles...

Bye,
bearophile


Re: rt_finalize WTFs?

2011-12-07 Thread Sean Kelly
On Dec 4, 2011, at 5:46 PM, dsimcha wrote:

 I'm at my traditional passtime of trying to speed up D's garbage collector 
 again, and I've stumbled on the fact that rt_finalize is taking up a 
 ridiculous share of the time (~30% of total runtime) on a benchmark where 
 huge numbers of classes **that don't have destructors** are being created and 
 collected.  Here's the code to this function, from lifetime.d:
 
 extern (C) void rt_finalize(void* p, bool det = true)
 {
debug(PRINTF) printf(rt_finalize(p = %p)\n, p);
 
if (p) // not necessary if called from gc
{
ClassInfo** pc = cast(ClassInfo**)p;
 
if (*pc)
{
ClassInfo c = **pc;
byte[]w = c.init;
 
try
{
if (det || collectHandler is null || 
 collectHandler(cast(Object)p))
{
do
{
if (c.destructor)
{
fp_t fp = cast(fp_t)c.destructor;
(*fp)(cast(Object)p); // call destructor
}
c = c.base;
} while (c);
}
if ((cast(void**)p)[1]) // if monitor is not null
_d_monitordelete(cast(Object)p, det);
(cast(byte*) p)[0 .. w.length] = w[];  // WTF?
}
catch (Throwable e)
{
onFinalizeError(**pc, e);
}
finally  // WTF?
{
*pc = null; // zero vptr
}
}
}
 }
 
 Getting rid of the stuff I've marked with //WTF? comments (namely the finally 
 block and the re-initializing of the memory occupied by the finalized object) 
 speeds things up by ~15% on the benchmark in question.  Why do we care what 
 state the blob of memory is left in after we finalize it?  I can kind of see 
 that we want to clear things if delete/clear was called manually and we want 
 to leave the object in a state that doesn't look valid.  However, this has 
 significant performance costs and IIRC is already done in clear() and delete 
 is supposed to be deprecated.  Furthermore, I'd like to get rid of the 
 finally block entirely, since I assume its presence and the effect on the 
 generated code is causing the slowdown, not the body, which just assigns a 
 pointer.

Now that having multiple in-flight exceptions is actually legal, we can 
probably just throw out the try/catch entirely.  The try/catch and call to 
onFinalizeError is a holdover from when it was effectively illegal to throw 
from a finalizer.

Re: rt_finalize WTFs?

2011-12-06 Thread Sean Kelly
I have an updated and win32-compilable version of CDGC in a branch. It's 
missing some features from the current GC though (it's based on the Tango GC 
which has remained relatively static for the past years while druntime's GC has 
improved).

Sent from my iPhone

On Dec 5, 2011, at 3:39 PM, Trass3r u...@known.com wrote:

 On 05/12/2011 01:46, dsimcha wrote:
 I'm at my traditional passtime of trying to speed up D's garbage
 collector again
 
 Have you thought about pushing for the inclusion of CDGC at all/working on 
 the tweaks needed to make it the main GC?
 
 So true, it's been rotting in that branch.


Re: rt_finalize WTFs?

2011-12-05 Thread Rainer Schuetze
Last time I looked at it, the try/catch/finally block was rather 
expensive because it always invokes the exception handler unwinding 
mechanism, even if no exception occurs.
Try moving the try/catch block out of the loops that call rt_finalize. 
(maybe just remove it, onFinalizeError just rethrows, and it seems noone 
is using the information that the exception is thrown from inside the 
finalizer)


On 05.12.2011 02:46, dsimcha wrote:

I'm at my traditional passtime of trying to speed up D's garbage
collector again, and I've stumbled on the fact that rt_finalize is
taking up a ridiculous share of the time (~30% of total runtime) on a
benchmark where huge numbers of classes **that don't have destructors**
are being created and collected. Here's the code to this function, from
lifetime.d:

extern (C) void rt_finalize(void* p, bool det = true)
{
debug(PRINTF) printf(rt_finalize(p = %p)\n, p);

if (p) // not necessary if called from gc
{
ClassInfo** pc = cast(ClassInfo**)p;

if (*pc)
{
ClassInfo c = **pc;
byte[] w = c.init;

try
{
if (det || collectHandler is null || collectHandler(cast(Object)p))
{
do
{
if (c.destructor)
{
fp_t fp = cast(fp_t)c.destructor;
(*fp)(cast(Object)p); // call destructor
}
c = c.base;
} while (c);
}
if ((cast(void**)p)[1]) // if monitor is not null
_d_monitordelete(cast(Object)p, det);
(cast(byte*) p)[0 .. w.length] = w[]; // WTF?
}
catch (Throwable e)
{
onFinalizeError(**pc, e);
}
finally // WTF?
{
*pc = null; // zero vptr
}
}
}
}

Getting rid of the stuff I've marked with //WTF? comments (namely the
finally block and the re-initializing of the memory occupied by the
finalized object) speeds things up by ~15% on the benchmark in question.
Why do we care what state the blob of memory is left in after we
finalize it? I can kind of see that we want to clear things if
delete/clear was called manually and we want to leave the object in a
state that doesn't look valid. However, this has significant performance
costs and IIRC is already done in clear() and delete is supposed to be
deprecated. Furthermore, I'd like to get rid of the finally block
entirely, since I assume its presence and the effect on the generated
code is causing the slowdown, not the body, which just assigns a pointer.

Is there any good reason to keep this code around?


Re: rt_finalize WTFs?

2011-12-05 Thread Vladimir Panteleev
On Mon, 05 Dec 2011 10:14:00 +0200, Rainer Schuetze r.sagita...@gmx.de  
wrote:



Try moving the try/catch block out of the loops that call rt_finalize.


This sounds like a good idea. Just make sure that all code paths that lead  
to rt_finalize (e.g. delete) can recover from an exception.


(maybe just remove it, onFinalizeError just rethrows, and it seems noone  
is using the information that the exception is thrown from inside the  
finalizer)


The point of onFinalizeError is that it throws an *error* (it is  
unrecoverable). Currently, the GC cannot recover from an exception thrown  
by a finalizer, which is why onFinalizeError exists.


--
Best regards,
 Vladimirmailto:vladi...@thecybershadow.net


Re: rt_finalize WTFs?

2011-12-05 Thread Steven Schveighoffer

On Sun, 04 Dec 2011 20:46:27 -0500, dsimcha dsim...@yahoo.com wrote:

I'm at my traditional passtime of trying to speed up D's garbage  
collector again, and I've stumbled on the fact that rt_finalize is  
taking up a ridiculous share of the time (~30% of total runtime) on a  
benchmark where huge numbers of classes **that don't have destructors**  
are being created and collected.  Here's the code to this function, from  
lifetime.d:


extern (C) void rt_finalize(void* p, bool det = true)
{
 debug(PRINTF) printf(rt_finalize(p = %p)\n, p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[]w = c.init;

 try
 {
 if (det || collectHandler is null ||  
collectHandler(cast(Object)p))

 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally  // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
}

Getting rid of the stuff I've marked with //WTF? comments (namely the  
finally block and the re-initializing of the memory occupied by the  
finalized object) speeds things up by ~15% on the benchmark in question.  
  Why do we care what state the blob of memory is left in after we  
finalize it?  I can kind of see that we want to clear things if  
delete/clear was called manually and we want to leave the object in a  
state that doesn't look valid.  However, this has significant  
performance costs and IIRC is already done in clear() and delete is  
supposed to be deprecated.  Furthermore, I'd like to get rid of the  
finally block entirely, since I assume its presence and the effect on  
the generated code is causing the slowdown, not the body, which just  
assigns a pointer.


I think it might be a good idea to move those two operations to the clear  
function.  Currently, clear(obj) simply calls rt_finalize(obj).


It does seem rather strange to have that finally there, why do we care on  
an exception/error that we zero the vptr?  I'd at least put that code up  
where the first WTF is.


But I think we do need to have the reinitialization of the object and  
zeroing of vptr for clear, since it's part of clear's charter.


I'd support any effort to speed up the GC, it's definitely the worst  
offender for performance.  Looks like there's quite a few good ideas in  
this thread.


-Steve


Re: rt_finalize WTFs?

2011-12-05 Thread Martin Nowak
On Mon, 05 Dec 2011 09:14:00 +0100, Rainer Schuetze r.sagita...@gmx.de  
wrote:


Last time I looked at it, the try/catch/finally block was rather  
expensive because it always invokes the exception handler unwinding  
mechanism, even if no exception occurs.
Try moving the try/catch block out of the loops that call rt_finalize.  
(maybe just remove it, onFinalizeError just rethrows, and it seems noone  
is using the information that the exception is thrown from inside the  
finalizer)



Just an unconditional jump into the finally body here, but
it still affects register assignment.
Install an exception handler at sweep scope would save quite some
Moving the exception handler to the sweep scope seems promising,
can save lots of register saves.

I appreciate the recursion during mark, wanted to do this myself
sometime ago but expected a little more gain.

Some more ideas:

 - Do a major refactoring of the GC code, making it less reluctant
   to changes. Adding sanity checks or unit tests would be great.
   This probably reveals some obfuscated performance issues.

 - Add more realistic GC benchmarks, just requires adding to
   druntime/test/gcbench using the new runbench. The tree1 mainly
   uses homogeneous classes, so this is very synthesized.

 - There is one binary search pool lookup for every scanned address in  
range.

   Should be a lot to gain here, but it's difficult. It needs a multilevel
   mixture of bitset/hashtab.

 - Reduce the GC roots range. I will have to work on this for
   shared library support anyhow.

martin


On 05.12.2011 02:46, dsimcha wrote:

I'm at my traditional passtime of trying to speed up D's garbage
collector again, and I've stumbled on the fact that rt_finalize is
taking up a ridiculous share of the time (~30% of total runtime) on a
benchmark where huge numbers of classes **that don't have destructors**
are being created and collected. Here's the code to this function, from
lifetime.d:

extern (C) void rt_finalize(void* p, bool det = true)
{
debug(PRINTF) printf(rt_finalize(p = %p)\n, p);

if (p) // not necessary if called from gc
{
ClassInfo** pc = cast(ClassInfo**)p;

if (*pc)
{
ClassInfo c = **pc;
byte[] w = c.init;

try
{
if (det || collectHandler is null || collectHandler(cast(Object)p))
{
do
{
if (c.destructor)
{
fp_t fp = cast(fp_t)c.destructor;
(*fp)(cast(Object)p); // call destructor
}
c = c.base;
} while (c);
}
if ((cast(void**)p)[1]) // if monitor is not null
_d_monitordelete(cast(Object)p, det);
(cast(byte*) p)[0 .. w.length] = w[]; // WTF?
}
catch (Throwable e)
{
onFinalizeError(**pc, e);
}
finally // WTF?
{
*pc = null; // zero vptr
}
}
}
}

Getting rid of the stuff I've marked with //WTF? comments (namely the
finally block and the re-initializing of the memory occupied by the
finalized object) speeds things up by ~15% on the benchmark in question.
Why do we care what state the blob of memory is left in after we
finalize it? I can kind of see that we want to clear things if
delete/clear was called manually and we want to leave the object in a
state that doesn't look valid. However, this has significant performance
costs and IIRC is already done in clear() and delete is supposed to be
deprecated. Furthermore, I'd like to get rid of the finally block
entirely, since I assume its presence and the effect on the generated
code is causing the slowdown, not the body, which just assigns a  
pointer.


Is there any good reason to keep this code around?


Re: rt_finalize WTFs?

2011-12-05 Thread dsimcha
== Quote from Martin Nowak (d...@dawgfoto.de)'s article
 I appreciate the recursion during mark, wanted to do this myself
 sometime ago but expected a little more gain.

The reason the gain wasn't huge is because on the benchmark I have that 
involves a
deep heap graph, sweeping time dominates marking time.  The performance gain for
the mark phase only (which is important b/c this is when the world needs to be
stopped) is ~20-30%.

 Some more ideas:
   - Do a major refactoring of the GC code, making it less reluctant
 to changes. Adding sanity checks or unit tests would be great.
 This probably reveals some obfuscated performance issues.

Not just obfuscated ones.  I've wanted to fix an obvious perf bug for two years
and haven't done it because the necessary refactoring would be unbelievably 
messy
and I'm too afraid I'll break something.  Basically, malloc() sets the bytes
between the size you requested and the size of the block actually allocated to
zero to prevent false pointers.  This is reasonable.  The problem is that it 
does
so **while holding the GC's lock**.  Fixing it for just the case when malloc() 
is
called by the user is also easy.  The problem is fixing it when malloc() gets
called from realloc(), calloc(), etc.

   - Add more realistic GC benchmarks, just requires adding to
 druntime/test/gcbench using the new runbench. The tree1 mainly
 uses homogeneous classes, so this is very synthesized.

I'll crowdsource this.  I can't think of any good benchmarks that are  a few
hundred lines w/ no dependencies but aren't pretty synthetic.

   - There is one binary search pool lookup for every scanned address in
 range.
 Should be a lot to gain here, but it's difficult. It needs a multilevel
 mixture of bitset/hashtab.

I understand the problem, but please elaborate on the proposed solution.  You've
basically got a bunch of pools, each of which represents a range of memory
addresses, not a single address (so a basic hashtable is out).  You need to know
which range some pointer fits in.  How would you beat binary search/O(log N) 
for this?

   - Reduce the GC roots range. I will have to work on this for
 shared library support anyhow.

Please clarify what you mean by reduce the roots range.

Thanks for the feedback/suggestions.


Re: rt_finalize WTFs?

2011-12-05 Thread Robert Clipsham

On 05/12/2011 01:46, dsimcha wrote:

I'm at my traditional passtime of trying to speed up D's garbage
collector again


Have you thought about pushing for the inclusion of CDGC at all/working 
on the tweaks needed to make it the main GC?


--
Robert
http://octarineparrot.com/


Re: rt_finalize WTFs?

2011-12-05 Thread Martin Nowak

On Mon, 05 Dec 2011 22:07:09 +0100, dsimcha dsim...@yahoo.com wrote:


== Quote from Martin Nowak (d...@dawgfoto.de)'s article

I appreciate the recursion during mark, wanted to do this myself
sometime ago but expected a little more gain.


The reason the gain wasn't huge is because on the benchmark I have that  
involves a
deep heap graph, sweeping time dominates marking time.  The performance  
gain for
the mark phase only (which is important b/c this is when the world needs  
to be

stopped) is ~20-30%.


Some more ideas:
  - Do a major refactoring of the GC code, making it less reluctant
to changes. Adding sanity checks or unit tests would be great.
This probably reveals some obfuscated performance issues.


Not just obfuscated ones.  I've wanted to fix an obvious perf bug for  
two years
and haven't done it because the necessary refactoring would be  
unbelievably messy
and I'm too afraid I'll break something.  Basically, malloc() sets the  
bytes
between the size you requested and the size of the block actually  
allocated to
zero to prevent false pointers.  This is reasonable.  The problem is  
that it does
so **while holding the GC's lock**.  Fixing it for just the case when  
malloc() is
called by the user is also easy.  The problem is fixing it when malloc()  
gets

called from realloc(), calloc(), etc.


  - Add more realistic GC benchmarks, just requires adding to
druntime/test/gcbench using the new runbench. The tree1 mainly
uses homogeneous classes, so this is very synthesized.


I'll crowdsource this.  I can't think of any good benchmarks that are   
a few

hundred lines w/ no dependencies but aren't pretty synthetic.


Probably write a tool that can replay recorded allocations profiles.
But it will be difficult to fake internal pointers if they are not  
recorded.
If we were able to enable recording in the runtime, people having  
performance
issues with the GC could then send a file that can be used for  
optimizations.



  - There is one binary search pool lookup for every scanned address in
range.
Should be a lot to gain here, but it's difficult. It needs a  
multilevel

mixture of bitset/hashtab.


I understand the problem, but please elaborate on the proposed  
solution.  You've
basically got a bunch of pools, each of which represents a range of  
memory
addresses, not a single address (so a basic hashtable is out).  You need  
to know
which range some pointer fits in.  How would you beat binary  
search/O(log N) for this?



It is possible to use a hashtable or a bitset because each pool occupies
at least PAGESIZE.
One would need to use addresses relative to the minimum address
but that still needs 256K entries per GByte, so the maintenance
overhead is likely too big.

More promising is to put pool addresses ranges in a trie.

addr[7]  [...  . ...]
 / |\
addr[6] [...   .   ...][...   .   ...]
   /   |\ /   |   \
addr[5] pool:8 [...   .  ...]
  /   |   \
addr[4]  pool:8 [] pool:5

Lookup is bounded constant time, with at maximum 8(4) memory accesses.
Maintenance is only required when adding/removing pools, little  
sophisticated though.


-

alias Node Trie;

struct Node
{
   union
   {
  Type   _type;
  Pool*  _pool;
  Node[16]*   _nodes16; // uses upper 3 bits per level
  ...
  Node[256]* _nodes256; // uses  full 8 bits per level
   }

   // use lower 3 bits to encode type, given guaranteed 8-byte alignment  
of malloc

   enum NTypeBits = 3;
   enum TypeMask = (1  NTypeBits) - 1);
   enum PtrMask = ~TypeMask;

   enum Type { Empty, Pool, Nodes16, Nodes256 };
   static assert(Type.max = TypeMask);

   Pool* getPool(void* p)
   {
   return getPoolImpl((cast(ubyte*)p) + size_t.sizeof - 1);
   }

   Pool* getPoolImpl(ubyte* p)
   {
   Node* node = void;

   final switch (type)
   {
   case Type.Empty:
  return null;
   case Type.Pool:
  return pool;
   case Type.Nodes16:
  node = nodes16[*p  5];
  break;
   case Type.Nodes256:
  node = nodes256[*p];
  break;
   }

   if (node.type == Type.Pool)
   return node.pool;
   else
   return node.getPoolImpl(p - 1);
   }

   void insertPool(Pool* npool, uint level=0)
   {
   final switch (type)
   {
   case Type.Empty:
   pool = npool;

   case Type.Pool:
   Pool *opool = pool;
   if (opool != npool)
   {
   nodes16 = new Node[16];

   foreach (i; 0 .. 16)
   {
   // non-trivial code here
   // Needs to figure out into which subnode opool and  
npool
   // should be inserted. They can use pool.minAddr and  
pool.maxAddr


   nodes16[i].insertPool(opool, level + 1);
   

Re: rt_finalize WTFs?

2011-12-05 Thread Vladimir Panteleev

On Mon, 05 Dec 2011 23:07:09 +0200, dsimcha dsim...@yahoo.com wrote:

I understand the problem, but please elaborate on the proposed  
solution.  You've
basically got a bunch of pools, each of which represents a range of  
memory
addresses, not a single address (so a basic hashtable is out).  You need  
to know
which range some pointer fits in.  How would you beat binary  
search/O(log N) for this?


A tree, with a few bits of the address space per level. It becomes bound  
to the size of the address space, not the number of pools.


--
Best regards,
 Vladimirmailto:vladi...@thecybershadow.net


Re: rt_finalize WTFs?

2011-12-05 Thread Iain Buclaw
On 5 December 2011 22:34, Martin Nowak d...@dawgfoto.de wrote:
 We currently add something from etext to _end (rt.memory) as static root.
 This probably contains read-only sections, data from other languages
 and (unlikely) other unrelated sections.
 I think some help from the compiler will be necessary to support
 shared libraries anyhow, so we should be able to get a precise range.


Indeed.  The current implementation kicking around is to scan
/proc/self/maps on init (this is true for the runtime that ships with
GDC at least).  Which is not the most pleasant of things to do - not
to mention only supports Unix-flavoured systems, but it works well
enough for when linking against shared D libraries.

-- 
Iain Buclaw

*(p  e ? p++ : p) = (c  0x0f) + '0';


Re: rt_finalize WTFs?

2011-12-05 Thread Martin Nowak

More promising is to put pool addresses ranges in a trie.

addr[7]  [...  . ...]
  / |\
addr[6] [...   .   ...][...   .   ...]
/   |\ /   |   \
addr[5] pool:8 [...   .  ...]
   /   |   \
addr[4]  pool:8 [] pool:5


Actually 64-bit should use a hashtable for the upper 32-bit and then
the the 32-bit trie for lower.


Re: rt_finalize WTFs?

2011-12-05 Thread dsimcha
== Quote from Martin Nowak (d...@dawgfoto.de)'s article
  More promising is to put pool addresses ranges in a trie.
 
  addr[7]  [...  . ...]
/ |\
  addr[6] [...   .   ...][...   .   ...]
  /   |\ /   |   \
  addr[5] pool:8 [...   .  ...]
 /   |   \
  addr[4]  pool:8 [] pool:5
 
 Actually 64-bit should use a hashtable for the upper 32-bit and then
 the the 32-bit trie for lower.

Why do you expect this to be faster than a binary search?  I'm not saying it 
won't
be, just that it's not a home run that deserves a high priority as an
optimization.  You still have a whole bunch of indirections, probably more than
you would ever have for binary search.


Re: rt_finalize WTFs?

2011-12-05 Thread Trass3r

On 05/12/2011 01:46, dsimcha wrote:

I'm at my traditional passtime of trying to speed up D's garbage
collector again


Have you thought about pushing for the inclusion of CDGC at all/working  
on the tweaks needed to make it the main GC?


So true, it's been rotting in that branch.


Re: rt_finalize WTFs?

2011-12-05 Thread dsimcha

On 12/5/2011 6:39 PM, Trass3r wrote:

On 05/12/2011 01:46, dsimcha wrote:

I'm at my traditional passtime of trying to speed up D's garbage
collector again


Have you thought about pushing for the inclusion of CDGC at
all/working on the tweaks needed to make it the main GC?


So true, it's been rotting in that branch.


IIRC CDGC includes two major enhancements:

1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything 
Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC 
is really that great an idea given its propensity to waste memory on 
long collect cycles with a lot of mutation.)


2.  I think there was some precise heap scanning-related stuff in it.  I 
originally tried to implement precise heap scanning a couple years ago, 
but it went nowhere for reasons too complicated to explain here.  Given 
this experience, I'm not inclined to try again until the compiler has 
extensions for generating pointer offset information.


Re: rt_finalize WTFs?

2011-12-05 Thread Martin Nowak

On Tue, 06 Dec 2011 00:16:01 +0100, dsimcha dsim...@yahoo.com wrote:


== Quote from Martin Nowak (d...@dawgfoto.de)'s article

 More promising is to put pool addresses ranges in a trie.

 addr[7]  [...  . ...]
   / |\
 addr[6] [...   .   ...][...   .   ...]
 /   |\ /   |   \
 addr[5] pool:8 [...   .  ...]
/   |   \
 addr[4]  pool:8 [] pool:5

Actually 64-bit should use a hashtable for the upper 32-bit and then
the the 32-bit trie for lower.


Why do you expect this to be faster than a binary search?  I'm not  
saying it won't

be, just that it's not a home run that deserves a high priority as an
optimization.  You still have a whole bunch of indirections, probably  
more than

you would ever have for binary search.


It's the tightest loop in the garbage collector.
It will end with few array accesses and the locality of
memory being pointed too is paired by tree locality,
thus you'd have good chances to finding them in the cache.

What this gives you is that it scales very good to many pools/regions.
Thus you can better tweak the pool granularity against pool sizes.

Also:
for (p  pend)
  if (*p in lastPool)
can be:
for (p  pend)
  if (*p in lastNode)

It is definitely no home run, but I'll try it
out when I find some time.


Re: rt_finalize WTFs?

2011-12-05 Thread Trass3r

IIRC CDGC includes two major enhancements:

1.  The snapshot GC for Linux.  (Does this work on OSX/FreeBSD/anything  
Posix, or just Linux?  I'm a bit skeptical about whether a snapshot GC  
is really that great an idea given its propensity to waste memory on  
long collect cycles with a lot of mutation.)


I guess, at least it uses fork IIRC.


2.  I think there was some precise heap scanning-related stuff in it.  I  
originally tried to implement precise heap scanning a couple years ago,  
but it went nowhere for reasons too complicated to explain here.  Given  
this experience, I'm not inclined to try again until the compiler has  
extensions for generating pointer offset information.


What's the status of that anyway? Every patch in that bugzilla entry is  
marked obsolete and there's no real final answer.


Re: rt_finalize WTFs?

2011-12-05 Thread Jacob Carlborg

On 2011-12-05 23:45, Iain Buclaw wrote:

On 5 December 2011 22:34, Martin Nowakd...@dawgfoto.de  wrote:

We currently add something from etext to _end (rt.memory) as static root.
This probably contains read-only sections, data from other languages
and (unlikely) other unrelated sections.
I think some help from the compiler will be necessary to support
shared libraries anyhow, so we should be able to get a precise range.



Indeed.  The current implementation kicking around is to scan
/proc/self/maps on init (this is true for the runtime that ships with
GDC at least).  Which is not the most pleasant of things to do - not
to mention only supports Unix-flavoured systems, but it works well
enough for when linking against shared D libraries.


/proc doesn't exist on Mac OS X.

--
/Jacob Carlborg


rt_finalize WTFs?

2011-12-04 Thread dsimcha
I'm at my traditional passtime of trying to speed up D's garbage 
collector again, and I've stumbled on the fact that rt_finalize is 
taking up a ridiculous share of the time (~30% of total runtime) on a 
benchmark where huge numbers of classes **that don't have destructors** 
are being created and collected.  Here's the code to this function, from 
lifetime.d:


extern (C) void rt_finalize(void* p, bool det = true)
{
debug(PRINTF) printf(rt_finalize(p = %p)\n, p);

if (p) // not necessary if called from gc
{
ClassInfo** pc = cast(ClassInfo**)p;

if (*pc)
{
ClassInfo c = **pc;
byte[]w = c.init;

try
{
if (det || collectHandler is null || 
collectHandler(cast(Object)p))

{
do
{
if (c.destructor)
{
fp_t fp = cast(fp_t)c.destructor;
(*fp)(cast(Object)p); // call destructor
}
c = c.base;
} while (c);
}
if ((cast(void**)p)[1]) // if monitor is not null
_d_monitordelete(cast(Object)p, det);
(cast(byte*) p)[0 .. w.length] = w[];  // WTF?
}
catch (Throwable e)
{
onFinalizeError(**pc, e);
}
finally  // WTF?
{
*pc = null; // zero vptr
}
}
}
}

Getting rid of the stuff I've marked with //WTF? comments (namely the 
finally block and the re-initializing of the memory occupied by the 
finalized object) speeds things up by ~15% on the benchmark in question. 
 Why do we care what state the blob of memory is left in after we 
finalize it?  I can kind of see that we want to clear things if 
delete/clear was called manually and we want to leave the object in a 
state that doesn't look valid.  However, this has significant 
performance costs and IIRC is already done in clear() and delete is 
supposed to be deprecated.  Furthermore, I'd like to get rid of the 
finally block entirely, since I assume its presence and the effect on 
the generated code is causing the slowdown, not the body, which just 
assigns a pointer.


Is there any good reason to keep this code around?


Re: rt_finalize WTFs?

2011-12-04 Thread Martin Nowak

On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha dsim...@yahoo.com wrote:

I'm at my traditional passtime of trying to speed up D's garbage  
collector again, and I've stumbled on the fact that rt_finalize is  
taking up a ridiculous share of the time (~30% of total runtime) on a  
benchmark where huge numbers of classes **that don't have destructors**  
are being created and collected.  Here's the code to this function, from  
lifetime.d:


extern (C) void rt_finalize(void* p, bool det = true)
{
 debug(PRINTF) printf(rt_finalize(p = %p)\n, p);

 if (p) // not necessary if called from gc
 {
 ClassInfo** pc = cast(ClassInfo**)p;

 if (*pc)
 {
 ClassInfo c = **pc;
 byte[]w = c.init;

 try
 {
 if (det || collectHandler is null ||  
collectHandler(cast(Object)p))

 {
 do
 {
 if (c.destructor)
 {
 fp_t fp = cast(fp_t)c.destructor;
 (*fp)(cast(Object)p); // call destructor
 }
 c = c.base;
 } while (c);
 }
 if ((cast(void**)p)[1]) // if monitor is not null
 _d_monitordelete(cast(Object)p, det);
 (cast(byte*) p)[0 .. w.length] = w[];  // WTF?
 }
 catch (Throwable e)
 {
 onFinalizeError(**pc, e);
 }
 finally  // WTF?
 {
 *pc = null; // zero vptr
 }
 }
 }
}

Getting rid of the stuff I've marked with //WTF? comments (namely the  
finally block and the re-initializing of the memory occupied by the  
finalized object) speeds things up by ~15% on the benchmark in question.  
  Why do we care what state the blob of memory is left in after we  
finalize it?  I can kind of see that we want to clear things if  
delete/clear was called manually and we want to leave the object in a  
state that doesn't look valid.  However, this has significant  
performance costs and IIRC is already done in clear() and delete is  
supposed to be deprecated.  Furthermore, I'd like to get rid of the  
finally block entirely, since I assume its presence and the effect on  
the generated code is causing the slowdown, not the body, which just  
assigns a pointer.


Is there any good reason to keep this code around?


Not for the try block. With errors being not recoverable you don't need to  
care
about zeroing the vtbl or you could just copy the code into the catch  
handler.

This seems to cause less spilled variables.

Most expensive is the call to a memcpy@PLT, replace it with something  
inlineable.

Zeroing is not much faster than copying init[] for small classes.

At least zeroing should be worth it, unless the GC would not scan the  
memory otherwise.


gcbench/tree1 = 41.8s = https://gist.github.com/1432117 = gcbench/tree1  
= 33.4s


Please add useful benchmarks to druntime.

martin


Re: rt_finalize WTFs?

2011-12-04 Thread bearophile
dsimcha:

 I've stumbled on the fact that rt_finalize is 
 taking up a ridiculous share of the time (~30% of total runtime) on a 
 benchmark where huge numbers of classes **that don't have destructors** 
 are being created and collected.

DMD or LDC/GDC compiler?


 extern (C) void rt_finalize(void* p, bool det = true)
 {
  debug(PRINTF) printf(rt_finalize(p = %p)\n, p);
 
  if (p) // not necessary if called from gc
  {

That if(p) seems fit to become a static if on bool template argument (but it 
needs to become D code or two C functions):

void rt_finalize(bool byGC)(void* p, bool det = true) { ...

Bye,
bearophile


Re: rt_finalize WTFs?

2011-12-04 Thread dsimcha
Thanks for the benchmark.  I ended up deciding to just create a second 
function, rt_finalize_gc, that gets rid of a whole bunch of cruft that 
isn't necessary in the GC case.  I think it's worth the small amount of 
code duplication it creates.  Here are the results of my efforts so far: 
 https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . 
I've got one other good idea that I think will shave a few seconds off 
the Tree1 benchmark if I don't run into any unforeseen obstacles in 
implementing it.


On 12/4/2011 10:07 PM, Martin Nowak wrote:

On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha dsim...@yahoo.com wrote:


I'm at my traditional passtime of trying to speed up D's garbage
collector again, and I've stumbled on the fact that rt_finalize is
taking up a ridiculous share of the time (~30% of total runtime) on a
benchmark where huge numbers of classes **that don't have
destructors** are being created and collected. Here's the code to this
function, from lifetime.d:

extern (C) void rt_finalize(void* p, bool det = true)
{
debug(PRINTF) printf(rt_finalize(p = %p)\n, p);

if (p) // not necessary if called from gc
{
ClassInfo** pc = cast(ClassInfo**)p;

if (*pc)
{
ClassInfo c = **pc;
byte[] w = c.init;

try
{
if (det || collectHandler is null || collectHandler(cast(Object)p))
{
do
{
if (c.destructor)
{
fp_t fp = cast(fp_t)c.destructor;
(*fp)(cast(Object)p); // call destructor
}
c = c.base;
} while (c);
}
if ((cast(void**)p)[1]) // if monitor is not null
_d_monitordelete(cast(Object)p, det);
(cast(byte*) p)[0 .. w.length] = w[]; // WTF?
}
catch (Throwable e)
{
onFinalizeError(**pc, e);
}
finally // WTF?
{
*pc = null; // zero vptr
}
}
}
}

Getting rid of the stuff I've marked with //WTF? comments (namely the
finally block and the re-initializing of the memory occupied by the
finalized object) speeds things up by ~15% on the benchmark in
question. Why do we care what state the blob of memory is left in
after we finalize it? I can kind of see that we want to clear things
if delete/clear was called manually and we want to leave the object in
a state that doesn't look valid. However, this has significant
performance costs and IIRC is already done in clear() and delete is
supposed to be deprecated. Furthermore, I'd like to get rid of the
finally block entirely, since I assume its presence and the effect on
the generated code is causing the slowdown, not the body, which just
assigns a pointer.

Is there any good reason to keep this code around?


Not for the try block. With errors being not recoverable you don't need
to care
about zeroing the vtbl or you could just copy the code into the catch
handler.
This seems to cause less spilled variables.

Most expensive is the call to a memcpy@PLT, replace it with something
inlineable.
Zeroing is not much faster than copying init[] for small classes.

At least zeroing should be worth it, unless the GC would not scan the
memory otherwise.

gcbench/tree1 = 41.8s = https://gist.github.com/1432117 =
gcbench/tree1 = 33.4s

Please add useful benchmarks to druntime.

martin