Re: (ITS#9017) Improving performance of commit sync in Windows

kriszyp Thu, 06 Feb 2020 08:43:44 -0800

--000000000000bdb86d059deaf350
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable


Sorry to keep pestering, but just pinging about this patch again, as I
still think this fix could benefit windows users. And at this point, I
think I can say we have tested it pretty well, running on our servers for
almost a year :).
Thanks,
Kris

On Wed, Sep 18, 2019 at 12:56 PM Kris Zyp <[email protected]> wrote:

> Checking on this again, is this still a possibility for merging into LMDB=
?
> This fix is still working great (improved performance) on our systems.
> Thanks,
> Kris
>
> On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp <[email protected]> wrote:
>
>> Is this still being considered/reviewed? Let me know if there are any
>> other changes you would like me to make. This patch has continued to yie=
ld
>> significant and reliable performance improvements for us, and seems like=
 it
>> would be nice for this to be available for other Windows users.
>>
>> On Fri, May 3, 2019 at 3:52 PM Kris Zyp <[email protected]> wrote:
>>
>>> For the sake of putting this in the email thread (other code discussion
>>> in GitHub), here is the latest squashed commit of the proposed patch (w=
ith
>>> the on-demand, retained overlapped array to reduce re-malloc and openin=
g
>>> event handles):
>>> https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75=
ee222072b990f
>>>
>>>
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> *From: *Kris Zyp <[email protected]>
>>> *Sent: *April 30, 2019 12:43 PM
>>> *To: *Howard Chu <[email protected]>; [email protected]
>>> *Subject: *RE: (ITS#9017) Improving performance of commit sync in
>>> Windows
>>>
>>>
>>>
>>> > What is the point of using writemap mode if you still need to use
>>> WriteFile
>>>
>>> > on every individual page?
>>>
>>>
>>>
>>> As I understood from the documentation, and have observed, using
>>> writemap mode is faster (and uses less temporary memory) because it doe=
sn=E2=80=99t
>>> require mallocs to allocate pages (docs: =E2=80=9CThis is faster and us=
es fewer
>>> mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e=
fficient,
>>> that in sync-mode, it takes enormous transactions before the time spent
>>> allocating and creating the dirty pages with the updated b-tree is anyw=
here
>>> even remotely close to the time it takes to wait for disk flushing, eve=
n
>>> with an SSD. But the more pertinent question is efficiency, and measuri=
ng
>>> CPU cycles rather than time spent (efficiency is more important than ju=
st
>>> time spent). When I ran my tests this morning of 100 (sync) transaction=
s
>>> with 100 puts per transaction, times varied quite a bit, but it seemed =
like
>>> running with writemap enabled typically averages about 500ms of CPU and
>>> with writemap disabled it typically averages around 600ms. Not a huge
>>> difference, but still definitely worthwhile, I think.
>>>
>>>
>>>
>>> Caveat emptor: Measuring LMDB performance with sync interactions on
>>> Windows is one of the most frustratingly erratic things to measure. It =
is
>>> sunny outside right now, times could be different when it starts rainin=
g
>>> later, but, this is what I saw this morning...
>>>
>>>
>>>
>>> > What is the performance difference between your patch using writemap,
>>> and just
>>>
>>> > not using writemap in the first place?
>>>
>>>
>>>
>>> Running 1000 sync transactions on 3GB db with a single put per
>>> transaction, without writemap map, without the patch took about 60 seco=
nds.
>>> And it took about 1 second with the patch with writemap mode enabled!
>>> (there is no significant difference in sync times with writemap enabled=
 or
>>> disabled with the patch.) So the difference was huge in my test. And no=
t
>>> only that, without the patch, the CPU usage was actually _*higher*_
>>> during that 60 seconds (close to 100% of a core) than during the execut=
ion
>>> with the patch for one second (close to 50%).  Anyway, there are certai=
nly
>>> tests I have run where the differences are not as large (doing small
>>> commits on large dbs accentuates the differences), but the patch always
>>> seems to win. It could also be that my particular configuration causes
>>> bigger differences (on an SSD drive, and maybe a more fragmented file?)=
.
>>>
>>>
>>>
>>> Anyway, I added error handling for the malloc, and fixed/changed the
>>> other things you suggested. Be happy to make any other changes you want=
.
>>> The updated patch is here:
>>>
>>>
>>> https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec1=
7b9b62094acde
>>>
>>>
>>>
>>> > OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED));
>>>
>>> > Probably this ought to just be pre-allocated based on the maximum
>>> number of dirty pages a txn allows.
>>>
>>>
>>>
>>> I wasn=E2=80=99t sure I understood this comment. Are you suggesting we =
malloc(MDB_IDL_UM_MAX
>>> * sizeof(OVERLAPPED)) for each environment, and retain it for the life =
of
>>> the environment? I think that is 4MB, if my math is right, which seems =
like
>>> a lot of memory to keep allocated (we usually have a lot of open
>>> environments). If the goal is to reduce the number of mallocs, how abou=
t we
>>> retain the OVERLAPPED array, and only free and re-malloc if the previou=
s
>>> allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unnece=
ssary allocation,
>>> and we only malloc when there is a bigger transaction than any previous=
. I
>>> put this together in a separate commit, as I wasn=E2=80=99t sure if thi=
s what you
>>> wanted (can squash if you prefer):
>>> https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17=
a4b2adefaac40
>>>
>>>
>>>
>>> Thank you for the review!
>>>
>>>
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> *From: *Howard Chu <[email protected]>
>>> *Sent: *April 30, 2019 7:12 AM
>>> *To: *[email protected]; [email protected]
>>> *Subject: *Re: (ITS#9017) Improving performance of commit sync in
>>> Windows
>>>
>>>
>>>
>>> [email protected] wrote:
>>>
>>> > Full_Name: Kristopher William Zyp
>>>
>>> > Version: LMDB 0.9.23
>>>
>>> > OS: Windows
>>>
>>> > URL:
>>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0a=
b9332b7fc4ce9
>>>
>>> > Submission from: (NULL) (71.199.6.148)
>>>
>>> >
>>>
>>> >
>>>
>>> > We have seen very poor performance on the sync of commits on large
>>> databases in
>>>
>>> > Windows. On databases with 2GB of data, in writemap mode, the sync of
>>> even small
>>>
>>> > commits is consistently well over 100ms (without writemap it is
>>> faster, but
>>>
>>> > still slow). It is expected that a sync should take some time while
>>> waiting for
>>>
>>> > disk confirmation of the writes, but more concerning is that these sy=
nc
>>>
>>> > operations (in writemap mode) are instead dominated by nearly 100%
>>> system CPU
>>>
>>> > utilization, so operations that requires sub-millisecond b-tree updat=
e
>>>
>>> > operations are then dominated by very large amounts of system CPU
>>> cycles during
>>>
>>> > the sync phase.
>>>
>>> >
>>>
>>> > I think that the fundamental problem is that FlushViewOfFile seems to
>>> be an O(n)
>>>
>>> > operation where n is the size of the file (or map). I presume that
>>> Windows is
>>>
>>> > scanning the entire map/file for dirty pages to flush, I'm guessing
>>> because it
>>>
>>> > doesn't have an internal index of all the dirty pages for every
>>> file/map-view in
>>>
>>> > the OS disk cache. Therefore, the turns into an extremely expensive,
>>> CPU-bound
>>>
>>> > operation to find the dirty pages for (large file) and initiate their
>>> writes,
>>>
>>> > which, of course, is contrary to the whole goal of a scalable databas=
e
>>> system.
>>>
>>> > And FlushFileBuffers is also relatively slow as well. We have
>>> attempted to batch
>>>
>>> > as many operations into single transaction as possible, but this is
>>> still a very
>>>
>>> > large overhead.
>>>
>>> >
>>>
>>> > The Windows docs for FlushFileBuffers itself warns about the
>>> inefficiencies of
>>>
>>> > this function (
>>> https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi=
-flushfilebuffers
>>> ).
>>>
>>> > Which also points to the solution: it is much faster to write out the
>>> dirty
>>>
>>> > pages with WriteFile through a sync file handle
>>> (FILE_FLAG_WRITE_THROUGH).
>>>
>>> >
>>>
>>> > The associated patch
>>>
>>> > (
>>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0a=
b9332b7fc4ce9
>>> )
>>>
>>> > is my attempt at implementing this solution, for Windows. Fortunately=
,
>>> with the
>>>
>>> > design of LMDB, this is relatively straightforward. LMDB already
>>> supports
>>>
>>> > writing out dirty pages with WriteFile calls. I added a write-through
>>> handle for
>>>
>>> > sending these writes directly to disk. I then made that file-handle
>>>
>>> > overlapped/asynchronously, so all the writes for a commit could be
>>> started in
>>>
>>> > overlap mode, and (at least theoretically) transfer in parallel to th=
e
>>> drive and
>>>
>>> > then used GetOverlappedResult to wait for the completion. So basicall=
y
>>>
>>> > mdb_page_flush becomes the sync. I extended the writing of dirty page=
s
>>> through
>>>
>>> > WriteFile to writemap mode as well (for writing meta too), so that
>>> WriteFile
>>>
>>> > with write-through can be used to flush the data without ever needing
>>> to call
>>>
>>> > FlushViewOfFile or FlushFileBuffers. I also implemented support for
>>> write
>>>
>>> > gathering in writemap mode where contiguous file positions infers
>>> contiguous
>>>
>>> > memory (by tracking the starting position with wdp and writing
>>> contiguous pages
>>>
>>> > in single operations). Sorting of the dirty list is maintained even i=
n
>>> writemap
>>>
>>> > mode for this purpose.
>>>
>>>
>>>
>>> What is the point of using writemap mode if you still need to use
>>> WriteFile
>>>
>>> on every individual page?
>>>
>>>
>>>
>>> > The performance benefits of this patch, in my testing, are
>>> considerable. Writing
>>>
>>> > out/syncing transactions is typically over 5x faster in writemap mode=
,
>>> and 2x
>>>
>>> > faster in standard mode. And perhaps more importantly (especially in
>>> environment
>>>
>>> > with many threads/processes), the efficiency benefits are even larger=
,
>>>
>>> > particularly in writemap mode, where there can be a 50-100x reduction
>>> in the
>>>
>>> > system CPU usage by using this patch. This brings windows performance
>>> with
>>>
>>> > sync'ed transactions in LMDB back into the range of "lightning"
>>> performance :).
>>>
>>>
>>>
>>> What is the performance difference between your patch using writemap,
>>> and just
>>>
>>> not using writemap in the first place?
>>>
>>>
>>>
>>> --
>>>
>>>   -- Howard Chu
>>>
>>>   CTO, Symas Corp.           http://www.symas.com
>>>
>>>   Director, Highland Sun     http://highlandsun.com/hyc/
>>>
>>>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>>>
>>>
>>>
>>>
>>>
>>

--000000000000bdb86d059deaf350
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Sorry to keep pestering, but just pinging about this patch=
 again, as I still think this fix could benefit windows users. And at this =
point, I think I can say we have tested it pretty well, running on our serv=
ers for almost a year :).<div>Thanks,</div><div>Kris</div></div><br><div cl=
ass=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Sep 18, 2=
019 at 12:56 PM Kris Zyp &lt;<a href=3D"mailto:[email protected]";>kriszyp@g=
mail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex"><div dir=3D"ltr">Checking on this again, is this still a possibilit=
y for merging into LMDB? This fix is still working great (improved performa=
nce) on our systems.<div>Thanks,</div><div>Kris</div></div><br><div class=
=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Jun 17, 2019=
 at 1:04 PM Kris Zyp &lt;<a href=3D"mailto:[email protected]"; target=3D"_bl=
ank">[email protected]</a>&gt; wrote:<br></div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2=
04);padding-left:1ex"><div dir=3D"ltr">Is this still being considered/revie=
wed? Let me know if there are any other changes you would like me to make. =
This patch has continued to yield significant and reliable performance impr=
ovements for us, and seems like it would be nice for this to be available f=
or other Windows users.</div><br><div class=3D"gmail_quote"><div dir=3D"ltr=
" class=3D"gmail_attr">On Fri, May 3, 2019 at 3:52 PM Kris Zyp &lt;<a href=
=3D"mailto:[email protected]"; target=3D"_blank">[email protected]</a>&gt; w=
rote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang=
=3D"EN-CA"><div><p class=3D"MsoNormal">For the sake of putting this in the =
email thread (other code discussion in GitHub), here is the latest squashed=
 commit of the proposed patch (with the on-demand, retained overlapped arra=
y to reduce re-malloc and opening event handles): <a href=3D"https://github=
.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f" tar=
get=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/726a9156662c703b=
f3d453aab75ee222072b990f</a></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u>=
</p><p class=3D"MsoNormal">Thanks,<br>Kris</p><p class=3D"MsoNormal"><u></u=
>=C2=A0<u></u></p><div style=3D"border-right:none;border-bottom:none;border=
-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p cl=
ass=3D"MsoNormal" style=3D"border:none;padding:0cm"><b>From: </b><a href=3D=
"mailto:[email protected]"; target=3D"_blank">Kris Zyp</a><br><b>Sent: </b>A=
pril 30, 2019 12:43 PM<br><b>To: </b><a href=3D"mailto:[email protected]"; targe=
t=3D"_blank">Howard Chu</a>; <a href=3D"mailto:[email protected]"; t=
arget=3D"_blank">[email protected]</a><br><b>Subject: </b>RE: (ITS#=
9017) Improving performance of commit sync in Windows</p></div><p class=3D"=
MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">&gt; What is the =
point of using writemap mode if you still need to use WriteFile<u></u><u></=
u></p><p class=3D"MsoNormal">&gt; on every individual page?<u></u><u></u></=
p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">As =
I understood from the documentation, and have observed, using writemap mode=
 is faster (and uses less temporary memory) because it doesn=E2=80=99t requ=
ire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fewer=
 mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and effi=
cient, that in sync-mode, it takes enormous transactions before the time sp=
ent allocating and creating the dirty pages with the updated b-tree is anyw=
here even remotely close to the time it takes to wait for disk flushing, ev=
en with an SSD. But the more pertinent question is efficiency, and measurin=
g CPU cycles rather than time spent (efficiency is more important than just=
 time spent). When I ran my tests this morning of 100 (sync) transactions w=
ith 100 puts per transaction, times varied quite a bit, but it seemed like =
running with writemap enabled typically averages about 500ms of CPU and wit=
h writemap disabled it typically averages around 600ms. Not a huge differen=
ce, but still definitely worthwhile, I think.<u></u><u></u></p><p class=3D"=
MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Caveat emptor: Me=
asuring LMDB performance with sync interactions on Windows is one of the mo=
st frustratingly erratic things to measure. It is sunny outside right now, =
times could be different when it starts raining later, but, this is what I =
saw this morning...<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u=
></u></p><p class=3D"MsoNormal">&gt; What is the performance difference bet=
ween your patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNo=
rmal">&gt; not using writemap in the first place?<u></u><u></u></p><p class=
=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Running 1000 =
sync transactions on 3GB db with a single put per transaction, without writ=
emap map, without the patch took about 60 seconds. And it took about 1 seco=
nd with the patch with writemap mode enabled! (there is no significant diff=
erence in sync times with writemap enabled or disabled with the patch.) So =
the difference was huge in my test. And not only that, without the patch, t=
he CPU usage was actually _<i>higher</i>_ during that 60 seconds (close to =
100% of a core) than during the execution with the patch for one second (cl=
ose to 50%).=C2=A0 Anyway, there are certainly tests I have run where the d=
ifferences are not as large (doing small commits on large dbs accentuates t=
he differences), but the patch always seems to win. It could also be that m=
y particular configuration causes bigger differences (on an SSD drive, and =
maybe a more fragmented file?).<u></u><u></u></p><p class=3D"MsoNormal"><u>=
</u>=C2=A0<u></u></p><p class=3D"MsoNormal">Anyway, I added error handling =
for the malloc, and fixed/changed the other things you suggested. Be happy =
to make any other changes you want. The updated patch is here:<u></u><u></u=
></p><p class=3D"MsoNormal"><a href=3D"https://github.com/kriszyp/node-lmdb=
/commit/25366dea9453749cf6637f43ec17b9b62094acde" target=3D"_blank">https:/=
/github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094ac=
de</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p c=
lass=3D"MsoNormal">&gt;<span><span style=3D"font-size:9pt;font-family:Conso=
las;color:rgb(36,41,46)"> OVERLAPPED* ov =3D </span></span><span><span styl=
e=3D"font-size:9pt;font-family:Consolas;color:rgb(0,92,197)">malloc</span><=
/span><span><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,=
41,46)">((pagecount - keep) * </span></span><span><span style=3D"font-size:=
9pt;font-family:Consolas;color:rgb(215,58,73)">sizeof</span></span><span><s=
pan style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">(OVERL=
APPED));</span></span><span><span style=3D"font-size:9pt;font-family:Consol=
as;color:rgb(36,41,46)"><u></u><u></u></span></span></p><p class=3D"MsoNorm=
al"><span><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41=
,46)">&gt; </span></span><span style=3D"font-size:10.5pt;font-family:&quot;=
Segoe UI&quot;,sans-serif;color:rgb(36,41,46);background:white">Probably th=
is ought to just be pre-allocated based on the maximum number of dirty page=
s a txn allows.</span><span style=3D"font-size:10.5pt;font-family:&quot;Seg=
oe UI&quot;,sans-serif;background:white"><u></u><u></u></span></p><p class=
=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:&quot;Segoe UI&q=
uot;,sans-serif;color:rgb(36,41,46);background:white"><u></u>=C2=A0<u></u><=
/span></p><p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-famil=
y:&quot;Segoe UI&quot;,sans-serif;color:rgb(36,41,46);background:white">I w=
asn=E2=80=99t sure I understood this comment. Are you suggesting we </span>=
malloc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retai=
n it for the life of the environment? I think that is 4MB, if my math is ri=
ght, which seems like a lot of memory to keep allocated (we usually have a =
lot of open environments). If the goal is to reduce the number of mallocs, =
how about we retain the OVERLAPPED array, and only free and re-malloc if th=
e previous allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t=
 unnecessary allocation, and we only malloc when there is a bigger transact=
ion than any previous. I put this together in a separate commit, as I wasn=
=E2=80=99t sure if this what you wanted (can squash if you prefer): <a href=
=3D"https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17=
a4b2adefaac40" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commi=
t/2fe68fb5269c843e2e789746a17a4b2adefaac40</a><u></u><u></u></p><p class=3D=
"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thank you for th=
e review! <span style=3D"font-size:10.5pt;font-family:&quot;Segoe UI&quot;,=
sans-serif;color:rgb(36,41,46);background:white"><u></u><u></u></span></p><=
p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thanks=
,<br>Kris<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><=
div style=3D"border-right:none;border-bottom:none;border-left:none;border-t=
op:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal"><=
b>From: </b><a href=3D"mailto:[email protected]"; target=3D"_blank">Howard Chu</=
a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a href=3D"mailto:k=
[email protected]" target=3D"_blank">[email protected]</a>; <a href=3D"mailt=
o:[email protected]" target=3D"_blank">[email protected]</a=
><br><b>Subject: </b>Re: (ITS#9017) Improving performance of commit sync in=
 Windows<u></u><u></u></p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u>=
</p><p class=3D"MsoNormal"><a href=3D"mailto:[email protected]"; target=3D"_=
blank">[email protected]</a> wrote:<u></u><u></u></p><p class=3D"MsoNormal"=
>&gt; Full_Name: Kristopher William Zyp<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; Version: LMDB 0.9.23<u></u><u></u></p><p class=3D"MsoNormal">&gt;=
 OS: Windows<u></u><u></u></p><p class=3D"MsoNormal">&gt; URL: <a href=3D"h=
ttps://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332=
b7fc4ce9" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/7ff=
525ae57684a163d32af74a0ab9332b7fc4ce9</a><u></u><u></u></p><p class=3D"MsoN=
ormal">&gt; Submission from: (NULL) (71.199.6.148)<u></u><u></u></p><p clas=
s=3D"MsoNormal">&gt; <u></u><u></u></p><p class=3D"MsoNormal">&gt; <u></u><=
u></u></p><p class=3D"MsoNormal">&gt; We have seen very poor performance on=
 the sync of commits on large databases in<u></u><u></u></p><p class=3D"Mso=
Normal">&gt; Windows. On databases with 2GB of data, in writemap mode, the =
sync of even small<u></u><u></u></p><p class=3D"MsoNormal">&gt; commits is =
consistently well over 100ms (without writemap it is faster, but<u></u><u><=
/u></p><p class=3D"MsoNormal">&gt; still slow). It is expected that a sync =
should take some time while waiting for<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; disk confirmation of the writes, but more concerning is that thes=
e sync<u></u><u></u></p><p class=3D"MsoNormal">&gt; operations (in writemap=
 mode) are instead dominated by nearly 100% system CPU<u></u><u></u></p><p =
class=3D"MsoNormal">&gt; utilization, so operations that requires sub-milli=
second b-tree update<u></u><u></u></p><p class=3D"MsoNormal">&gt; operation=
s are then dominated by very large amounts of system CPU cycles during<u></=
u><u></u></p><p class=3D"MsoNormal">&gt; the sync phase.<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; <u></u><u></u></p><p class=3D"MsoNormal">&gt; I =
think that the fundamental problem is that FlushViewOfFile seems to be an O=
(n)<u></u><u></u></p><p class=3D"MsoNormal">&gt; operation where n is the s=
ize of the file (or map). I presume that Windows is<u></u><u></u></p><p cla=
ss=3D"MsoNormal">&gt; scanning the entire map/file for dirty pages to flush=
, I&#39;m guessing because it<u></u><u></u></p><p class=3D"MsoNormal">&gt; =
doesn&#39;t have an internal index of all the dirty pages for every file/ma=
p-view in<u></u><u></u></p><p class=3D"MsoNormal">&gt; the OS disk cache. T=
herefore, the turns into an extremely expensive, CPU-bound<u></u><u></u></p=
><p class=3D"MsoNormal">&gt; operation to find the dirty pages for (large f=
ile) and initiate their writes,<u></u><u></u></p><p class=3D"MsoNormal">&gt=
; which, of course, is contrary to the whole goal of a scalable database sy=
stem.<u></u><u></u></p><p class=3D"MsoNormal">&gt; And FlushFileBuffers is =
also relatively slow as well. We have attempted to batch<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; as many operations into single transaction as po=
ssible, but this is still a very<u></u><u></u></p><p class=3D"MsoNormal">&g=
t; large overhead.<u></u><u></u></p><p class=3D"MsoNormal">&gt; <u></u><u><=
/u></p><p class=3D"MsoNormal">&gt; The Windows docs for FlushFileBuffers it=
self warns about the inefficiencies of<u></u><u></u></p><p class=3D"MsoNorm=
al">&gt; this function (<a href=3D"https://docs.microsoft.com/en-us/windows=
/desktop/api/fileapi/nf-fileapi-flushfilebuffers" target=3D"_blank">https:/=
/docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfileb=
uffers</a>).<u></u><u></u></p><p class=3D"MsoNormal">&gt; Which also points=
 to the solution: it is much faster to write out the dirty<u></u><u></u></p=
><p class=3D"MsoNormal">&gt; pages with WriteFile through a sync file handl=
e (FILE_FLAG_WRITE_THROUGH).<u></u><u></u></p><p class=3D"MsoNormal">&gt; <=
u></u><u></u></p><p class=3D"MsoNormal">&gt; The associated patch<u></u><u>=
</u></p><p class=3D"MsoNormal">&gt; (<a href=3D"https://github.com/kriszyp/=
node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" target=3D"_blank=
">https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9=
332b7fc4ce9</a>)<u></u><u></u></p><p class=3D"MsoNormal">&gt; is my attempt=
 at implementing this solution, for Windows. Fortunately, with the<u></u><u=
></u></p><p class=3D"MsoNormal">&gt; design of LMDB, this is relatively str=
aightforward. LMDB already supports<u></u><u></u></p><p class=3D"MsoNormal"=
>&gt; writing out dirty pages with WriteFile calls. I added a write-through=
 handle for<u></u><u></u></p><p class=3D"MsoNormal">&gt; sending these writ=
es directly to disk. I then made that file-handle<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; overlapped/asynchronously, so all the writes for a comm=
it could be started in<u></u><u></u></p><p class=3D"MsoNormal">&gt; overlap=
 mode, and (at least theoretically) transfer in parallel to the drive and<u=
></u><u></u></p><p class=3D"MsoNormal">&gt; then used GetOverlappedResult t=
o wait for the completion. So basically<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; mdb_page_flush becomes the sync. I extended the writing of dirty =
pages through<u></u><u></u></p><p class=3D"MsoNormal">&gt; WriteFile to wri=
temap mode as well (for writing meta too), so that WriteFile<u></u><u></u><=
/p><p class=3D"MsoNormal">&gt; with write-through can be used to flush the =
data without ever needing to call<u></u><u></u></p><p class=3D"MsoNormal">&=
gt; FlushViewOfFile or FlushFileBuffers. I also implemented support for wri=
te<u></u><u></u></p><p class=3D"MsoNormal">&gt; gathering in writemap mode =
where contiguous file positions infers contiguous<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; memory (by tracking the starting position with wdp and =
writing contiguous pages<u></u><u></u></p><p class=3D"MsoNormal">&gt; in si=
ngle operations). Sorting of the dirty list is maintained even in writemap<=
u></u><u></u></p><p class=3D"MsoNormal">&gt; mode for this purpose.<u></u><=
u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor=
mal">What is the point of using writemap mode if you still need to use Writ=
eFile<u></u><u></u></p><p class=3D"MsoNormal">on every individual page?<u><=
/u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"Ms=
oNormal">&gt; The performance benefits of this patch, in my testing, are co=
nsiderable. Writing<u></u><u></u></p><p class=3D"MsoNormal">&gt; out/syncin=
g transactions is typically over 5x faster in writemap mode, and 2x<u></u><=
u></u></p><p class=3D"MsoNormal">&gt; faster in standard mode. And perhaps =
more importantly (especially in environment<u></u><u></u></p><p class=3D"Ms=
oNormal">&gt; with many threads/processes), the efficiency benefits are eve=
n larger,<u></u><u></u></p><p class=3D"MsoNormal">&gt; particularly in writ=
emap mode, where there can be a 50-100x reduction in the<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; system CPU usage by using this patch. This bring=
s windows performance with<u></u><u></u></p><p class=3D"MsoNormal">&gt; syn=
c&#39;ed transactions in LMDB back into the range of &quot;lightning&quot; =
performance :).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u=
></p><p class=3D"MsoNormal">What is the performance difference between your=
 patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNormal">not=
 using writemap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"=
><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">-- <u></u><u></u></p><p cla=
ss=3D"MsoNormal">=C2=A0=C2=A0-- Howard Chu<u></u><u></u></p><p class=3D"Mso=
Normal">=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 <a href=3D"http://www.symas.com"; target=3D"_blank">http:=
//www.symas.com</a><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0 Director=
, Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 <a href=3D"http://highlandsun.com/hy=
c/" target=3D"_blank">http://highlandsun.com/hyc/</a><u></u><u></u></p><p c=
lass=3D"MsoNormal">=C2=A0 Chief Architect, OpenLDAP=C2=A0 <a href=3D"http:/=
/www.openldap.org/project/" target=3D"_blank">http://www.openldap.org/proje=
ct/</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p =
class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></div></div></blockquote></div>
</blockquote></div>
</blockquote></div>

--000000000000bdb86d059deaf350--

Re: (ITS#9017) Improving performance of commit sync in Windows

Reply via email to