12-Jun-2014 10:34, Rainer Schuetze пишет:


On 11.06.2014 18:59, Dmitry Olshansky wrote:
03-Jun-2014 11:35, Rainer Schuetze пишет:
Hi,

more GC talk: the last couple of days, I've been experimenting with
implementing a concurrent GC on Windows inspired by Leandros CDGC.
Here's a report on my experiments:

http://rainers.github.io/visuald/druntime/concurrentgc.html

[snip]

See the sketch of the idea here :
https://gist.github.com/DmitryOlshansky/5e32057e047425480f0e


Cool stuff! I remember trying something similar, but IIRC forcing the
same address with MapViewOfFile somehow failed (maybe this was across
processes). I tried your version on both Win32 and Win64 successfully,
though.



I implemented the QueryWorkingSetEx version like this (you need a
converted psapi.lib for Win32):

Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.


enum PAGES = 512; //SIZE / 4096;
PSAPI_WORKING_SET_EX_INFORMATION[PAGES] info;
foreach(i, ref inf; info)
     inf.VirtualAddress = heap + i * 4096;
if (!QueryWorkingSetEx(GetCurrentProcess(), info.ptr, info.sizeof))
     throw new Exception(format("Could not query info (%d).\n",
GetLastError()));

foreach(i, ref inf; info)
     writefln("flags page %d: %x", i, inf.VirtualAttributes);


and you can check the "shared" field to get copied pages.


This function
is not supported on XP, though.

I wouldn't worry about it, it's not like XP users are growing in numbers. Also it looks like only 64bit version is good to go, as on 32bit it would reduce usable memory in half.

A short benchmark shows that VirtualQuery needs 55/42 ms for your test
on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17
ms for both.

Seems in line with my measurements. Strictly speaking 1/2 of pages, interleaved should give the estimate of the worst case. Together with remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb of heap.

If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
copied), I get 80-90 ms more.

Aye... this is a lot. Also for me it turns out that unmapping CoW view at the last step takes the most of time. It might help to split the full heap into multiple views.

Also using VirtualProtect during the first step - turning a mapping into CoW one is faster then unmap/map (by factor of 2).

One thing that may help is saving a pointer to the end of used heap at the moment of scan, then remaping only this portion as COW.

Last issue I see is adjustment of pointers - in a GC, the mapped view is mapped at new address so it would need a fixup them during scanning.


The numbers are not great, but I guess the usual memory usage and number
of modified pages will be much lower. I'll see if I can integrate this
into the concurrent implementation.

Wish you luck, I'm still not sure if it will help.

--
Dmitry Olshansky

Reply via email to