--000000000000bdb86d059deaf350 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Sorry to keep pestering, but just pinging about this patch again, as I still think this fix could benefit windows users. And at this point, I think I can say we have tested it pretty well, running on our servers for almost a year :). Thanks, Kris On Wed, Sep 18, 2019 at 12:56 PM Kris Zyp <[email protected]> wrote: > Checking on this again, is this still a possibility for merging into LMDB= ? > This fix is still working great (improved performance) on our systems. > Thanks, > Kris > > On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp <[email protected]> wrote: > >> Is this still being considered/reviewed? Let me know if there are any >> other changes you would like me to make. This patch has continued to yie= ld >> significant and reliable performance improvements for us, and seems like= it >> would be nice for this to be available for other Windows users. >> >> On Fri, May 3, 2019 at 3:52 PM Kris Zyp <[email protected]> wrote: >> >>> For the sake of putting this in the email thread (other code discussion >>> in GitHub), here is the latest squashed commit of the proposed patch (w= ith >>> the on-demand, retained overlapped array to reduce re-malloc and openin= g >>> event handles): >>> https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75= ee222072b990f >>> >>> >>> >>> Thanks, >>> Kris >>> >>> >>> >>> *From: *Kris Zyp <[email protected]> >>> *Sent: *April 30, 2019 12:43 PM >>> *To: *Howard Chu <[email protected]>; [email protected] >>> *Subject: *RE: (ITS#9017) Improving performance of commit sync in >>> Windows >>> >>> >>> >>> > What is the point of using writemap mode if you still need to use >>> WriteFile >>> >>> > on every individual page? >>> >>> >>> >>> As I understood from the documentation, and have observed, using >>> writemap mode is faster (and uses less temporary memory) because it doe= sn=E2=80=99t >>> require mallocs to allocate pages (docs: =E2=80=9CThis is faster and us= es fewer >>> mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e= fficient, >>> that in sync-mode, it takes enormous transactions before the time spent >>> allocating and creating the dirty pages with the updated b-tree is anyw= here >>> even remotely close to the time it takes to wait for disk flushing, eve= n >>> with an SSD. But the more pertinent question is efficiency, and measuri= ng >>> CPU cycles rather than time spent (efficiency is more important than ju= st >>> time spent). When I ran my tests this morning of 100 (sync) transaction= s >>> with 100 puts per transaction, times varied quite a bit, but it seemed = like >>> running with writemap enabled typically averages about 500ms of CPU and >>> with writemap disabled it typically averages around 600ms. Not a huge >>> difference, but still definitely worthwhile, I think. >>> >>> >>> >>> Caveat emptor: Measuring LMDB performance with sync interactions on >>> Windows is one of the most frustratingly erratic things to measure. It = is >>> sunny outside right now, times could be different when it starts rainin= g >>> later, but, this is what I saw this morning... >>> >>> >>> >>> > What is the performance difference between your patch using writemap, >>> and just >>> >>> > not using writemap in the first place? >>> >>> >>> >>> Running 1000 sync transactions on 3GB db with a single put per >>> transaction, without writemap map, without the patch took about 60 seco= nds. >>> And it took about 1 second with the patch with writemap mode enabled! >>> (there is no significant difference in sync times with writemap enabled= or >>> disabled with the patch.) So the difference was huge in my test. And no= t >>> only that, without the patch, the CPU usage was actually _*higher*_ >>> during that 60 seconds (close to 100% of a core) than during the execut= ion >>> with the patch for one second (close to 50%). Anyway, there are certai= nly >>> tests I have run where the differences are not as large (doing small >>> commits on large dbs accentuates the differences), but the patch always >>> seems to win. It could also be that my particular configuration causes >>> bigger differences (on an SSD drive, and maybe a more fragmented file?)= . >>> >>> >>> >>> Anyway, I added error handling for the malloc, and fixed/changed the >>> other things you suggested. Be happy to make any other changes you want= . >>> The updated patch is here: >>> >>> >>> https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec1= 7b9b62094acde >>> >>> >>> >>> > OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED)); >>> >>> > Probably this ought to just be pre-allocated based on the maximum >>> number of dirty pages a txn allows. >>> >>> >>> >>> I wasn=E2=80=99t sure I understood this comment. Are you suggesting we = malloc(MDB_IDL_UM_MAX >>> * sizeof(OVERLAPPED)) for each environment, and retain it for the life = of >>> the environment? I think that is 4MB, if my math is right, which seems = like >>> a lot of memory to keep allocated (we usually have a lot of open >>> environments). If the goal is to reduce the number of mallocs, how abou= t we >>> retain the OVERLAPPED array, and only free and re-malloc if the previou= s >>> allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unnece= ssary allocation, >>> and we only malloc when there is a bigger transaction than any previous= . I >>> put this together in a separate commit, as I wasn=E2=80=99t sure if thi= s what you >>> wanted (can squash if you prefer): >>> https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17= a4b2adefaac40 >>> >>> >>> >>> Thank you for the review! >>> >>> >>> >>> Thanks, >>> Kris >>> >>> >>> >>> *From: *Howard Chu <[email protected]> >>> *Sent: *April 30, 2019 7:12 AM >>> *To: *[email protected]; [email protected] >>> *Subject: *Re: (ITS#9017) Improving performance of commit sync in >>> Windows >>> >>> >>> >>> [email protected] wrote: >>> >>> > Full_Name: Kristopher William Zyp >>> >>> > Version: LMDB 0.9.23 >>> >>> > OS: Windows >>> >>> > URL: >>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0a= b9332b7fc4ce9 >>> >>> > Submission from: (NULL) (71.199.6.148) >>> >>> > >>> >>> > >>> >>> > We have seen very poor performance on the sync of commits on large >>> databases in >>> >>> > Windows. On databases with 2GB of data, in writemap mode, the sync of >>> even small >>> >>> > commits is consistently well over 100ms (without writemap it is >>> faster, but >>> >>> > still slow). It is expected that a sync should take some time while >>> waiting for >>> >>> > disk confirmation of the writes, but more concerning is that these sy= nc >>> >>> > operations (in writemap mode) are instead dominated by nearly 100% >>> system CPU >>> >>> > utilization, so operations that requires sub-millisecond b-tree updat= e >>> >>> > operations are then dominated by very large amounts of system CPU >>> cycles during >>> >>> > the sync phase. >>> >>> > >>> >>> > I think that the fundamental problem is that FlushViewOfFile seems to >>> be an O(n) >>> >>> > operation where n is the size of the file (or map). I presume that >>> Windows is >>> >>> > scanning the entire map/file for dirty pages to flush, I'm guessing >>> because it >>> >>> > doesn't have an internal index of all the dirty pages for every >>> file/map-view in >>> >>> > the OS disk cache. Therefore, the turns into an extremely expensive, >>> CPU-bound >>> >>> > operation to find the dirty pages for (large file) and initiate their >>> writes, >>> >>> > which, of course, is contrary to the whole goal of a scalable databas= e >>> system. >>> >>> > And FlushFileBuffers is also relatively slow as well. We have >>> attempted to batch >>> >>> > as many operations into single transaction as possible, but this is >>> still a very >>> >>> > large overhead. >>> >>> > >>> >>> > The Windows docs for FlushFileBuffers itself warns about the >>> inefficiencies of >>> >>> > this function ( >>> https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi= -flushfilebuffers >>> ). >>> >>> > Which also points to the solution: it is much faster to write out the >>> dirty >>> >>> > pages with WriteFile through a sync file handle >>> (FILE_FLAG_WRITE_THROUGH). >>> >>> > >>> >>> > The associated patch >>> >>> > ( >>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0a= b9332b7fc4ce9 >>> ) >>> >>> > is my attempt at implementing this solution, for Windows. Fortunately= , >>> with the >>> >>> > design of LMDB, this is relatively straightforward. LMDB already >>> supports >>> >>> > writing out dirty pages with WriteFile calls. I added a write-through >>> handle for >>> >>> > sending these writes directly to disk. I then made that file-handle >>> >>> > overlapped/asynchronously, so all the writes for a commit could be >>> started in >>> >>> > overlap mode, and (at least theoretically) transfer in parallel to th= e >>> drive and >>> >>> > then used GetOverlappedResult to wait for the completion. So basicall= y >>> >>> > mdb_page_flush becomes the sync. I extended the writing of dirty page= s >>> through >>> >>> > WriteFile to writemap mode as well (for writing meta too), so that >>> WriteFile >>> >>> > with write-through can be used to flush the data without ever needing >>> to call >>> >>> > FlushViewOfFile or FlushFileBuffers. I also implemented support for >>> write >>> >>> > gathering in writemap mode where contiguous file positions infers >>> contiguous >>> >>> > memory (by tracking the starting position with wdp and writing >>> contiguous pages >>> >>> > in single operations). Sorting of the dirty list is maintained even i= n >>> writemap >>> >>> > mode for this purpose. >>> >>> >>> >>> What is the point of using writemap mode if you still need to use >>> WriteFile >>> >>> on every individual page? >>> >>> >>> >>> > The performance benefits of this patch, in my testing, are >>> considerable. Writing >>> >>> > out/syncing transactions is typically over 5x faster in writemap mode= , >>> and 2x >>> >>> > faster in standard mode. And perhaps more importantly (especially in >>> environment >>> >>> > with many threads/processes), the efficiency benefits are even larger= , >>> >>> > particularly in writemap mode, where there can be a 50-100x reduction >>> in the >>> >>> > system CPU usage by using this patch. This brings windows performance >>> with >>> >>> > sync'ed transactions in LMDB back into the range of "lightning" >>> performance :). >>> >>> >>> >>> What is the performance difference between your patch using writemap, >>> and just >>> >>> not using writemap in the first place? >>> >>> >>> >>> -- >>> >>> -- Howard Chu >>> >>> CTO, Symas Corp. http://www.symas.com >>> >>> Director, Highland Sun http://highlandsun.com/hyc/ >>> >>> Chief Architect, OpenLDAP http://www.openldap.org/project/ >>> >>> >>> >>> >>> >> --000000000000bdb86d059deaf350 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Sorry to keep pestering, but just pinging about this patch= again, as I still think this fix could benefit windows users. And at this = point, I think I can say we have tested it pretty well, running on our serv= ers for almost a year :).<div>Thanks,</div><div>Kris</div></div><br><div cl= ass=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Sep 18, 2= 019 at 12:56 PM Kris Zyp <<a href=3D"mailto:[email protected]">kriszyp@g= mail.com</a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D= "margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le= ft:1ex"><div dir=3D"ltr">Checking on this again, is this still a possibilit= y for merging into LMDB? This fix is still working great (improved performa= nce) on our systems.<div>Thanks,</div><div>Kris</div></div><br><div class= =3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Jun 17, 2019= at 1:04 PM Kris Zyp <<a href=3D"mailto:[email protected]" target=3D"_bl= ank">[email protected]</a>> wrote:<br></div><blockquote class=3D"gmail_q= uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2= 04);padding-left:1ex"><div dir=3D"ltr">Is this still being considered/revie= wed? Let me know if there are any other changes you would like me to make. = This patch has continued to yield significant and reliable performance impr= ovements for us, and seems like it would be nice for this to be available f= or other Windows users.</div><br><div class=3D"gmail_quote"><div dir=3D"ltr= " class=3D"gmail_attr">On Fri, May 3, 2019 at 3:52 PM Kris Zyp <<a href= =3D"mailto:[email protected]" target=3D"_blank">[email protected]</a>> w= rote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p= x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang= =3D"EN-CA"><div><p class=3D"MsoNormal">For the sake of putting this in the = email thread (other code discussion in GitHub), here is the latest squashed= commit of the proposed patch (with the on-demand, retained overlapped arra= y to reduce re-malloc and opening event handles): <a href=3D"https://github= .com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f" tar= get=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/726a9156662c703b= f3d453aab75ee222072b990f</a></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u>= </p><p class=3D"MsoNormal">Thanks,<br>Kris</p><p class=3D"MsoNormal"><u></u= >=C2=A0<u></u></p><div style=3D"border-right:none;border-bottom:none;border= -left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p cl= ass=3D"MsoNormal" style=3D"border:none;padding:0cm"><b>From: </b><a href=3D= "mailto:[email protected]" target=3D"_blank">Kris Zyp</a><br><b>Sent: </b>A= pril 30, 2019 12:43 PM<br><b>To: </b><a href=3D"mailto:[email protected]" targe= t=3D"_blank">Howard Chu</a>; <a href=3D"mailto:[email protected]" t= arget=3D"_blank">[email protected]</a><br><b>Subject: </b>RE: (ITS#= 9017) Improving performance of commit sync in Windows</p></div><p class=3D"= MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">> What is the = point of using writemap mode if you still need to use WriteFile<u></u><u></= u></p><p class=3D"MsoNormal">> on every individual page?<u></u><u></u></= p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">As = I understood from the documentation, and have observed, using writemap mode= is faster (and uses less temporary memory) because it doesn=E2=80=99t requ= ire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fewer= mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and effi= cient, that in sync-mode, it takes enormous transactions before the time sp= ent allocating and creating the dirty pages with the updated b-tree is anyw= here even remotely close to the time it takes to wait for disk flushing, ev= en with an SSD. But the more pertinent question is efficiency, and measurin= g CPU cycles rather than time spent (efficiency is more important than just= time spent). When I ran my tests this morning of 100 (sync) transactions w= ith 100 puts per transaction, times varied quite a bit, but it seemed like = running with writemap enabled typically averages about 500ms of CPU and wit= h writemap disabled it typically averages around 600ms. Not a huge differen= ce, but still definitely worthwhile, I think.<u></u><u></u></p><p class=3D"= MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Caveat emptor: Me= asuring LMDB performance with sync interactions on Windows is one of the mo= st frustratingly erratic things to measure. It is sunny outside right now, = times could be different when it starts raining later, but, this is what I = saw this morning...<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u= ></u></p><p class=3D"MsoNormal">> What is the performance difference bet= ween your patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNo= rmal">> not using writemap in the first place?<u></u><u></u></p><p class= =3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Running 1000 = sync transactions on 3GB db with a single put per transaction, without writ= emap map, without the patch took about 60 seconds. And it took about 1 seco= nd with the patch with writemap mode enabled! (there is no significant diff= erence in sync times with writemap enabled or disabled with the patch.) So = the difference was huge in my test. And not only that, without the patch, t= he CPU usage was actually _<i>higher</i>_ during that 60 seconds (close to = 100% of a core) than during the execution with the patch for one second (cl= ose to 50%).=C2=A0 Anyway, there are certainly tests I have run where the d= ifferences are not as large (doing small commits on large dbs accentuates t= he differences), but the patch always seems to win. It could also be that m= y particular configuration causes bigger differences (on an SSD drive, and = maybe a more fragmented file?).<u></u><u></u></p><p class=3D"MsoNormal"><u>= </u>=C2=A0<u></u></p><p class=3D"MsoNormal">Anyway, I added error handling = for the malloc, and fixed/changed the other things you suggested. Be happy = to make any other changes you want. The updated patch is here:<u></u><u></u= ></p><p class=3D"MsoNormal"><a href=3D"https://github.com/kriszyp/node-lmdb= /commit/25366dea9453749cf6637f43ec17b9b62094acde" target=3D"_blank">https:/= /github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094ac= de</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p c= lass=3D"MsoNormal">><span><span style=3D"font-size:9pt;font-family:Conso= las;color:rgb(36,41,46)"> OVERLAPPED* ov =3D </span></span><span><span styl= e=3D"font-size:9pt;font-family:Consolas;color:rgb(0,92,197)">malloc</span><= /span><span><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,= 41,46)">((pagecount - keep) * </span></span><span><span style=3D"font-size:= 9pt;font-family:Consolas;color:rgb(215,58,73)">sizeof</span></span><span><s= pan style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">(OVERL= APPED));</span></span><span><span style=3D"font-size:9pt;font-family:Consol= as;color:rgb(36,41,46)"><u></u><u></u></span></span></p><p class=3D"MsoNorm= al"><span><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41= ,46)">> </span></span><span style=3D"font-size:10.5pt;font-family:"= Segoe UI",sans-serif;color:rgb(36,41,46);background:white">Probably th= is ought to just be pre-allocated based on the maximum number of dirty page= s a txn allows.</span><span style=3D"font-size:10.5pt;font-family:"Seg= oe UI",sans-serif;background:white"><u></u><u></u></span></p><p class= =3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:"Segoe UI&q= uot;,sans-serif;color:rgb(36,41,46);background:white"><u></u>=C2=A0<u></u><= /span></p><p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-famil= y:"Segoe UI",sans-serif;color:rgb(36,41,46);background:white">I w= asn=E2=80=99t sure I understood this comment. Are you suggesting we </span>= malloc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retai= n it for the life of the environment? I think that is 4MB, if my math is ri= ght, which seems like a lot of memory to keep allocated (we usually have a = lot of open environments). If the goal is to reduce the number of mallocs, = how about we retain the OVERLAPPED array, and only free and re-malloc if th= e previous allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t= unnecessary allocation, and we only malloc when there is a bigger transact= ion than any previous. I put this together in a separate commit, as I wasn= =E2=80=99t sure if this what you wanted (can squash if you prefer): <a href= =3D"https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17= a4b2adefaac40" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commi= t/2fe68fb5269c843e2e789746a17a4b2adefaac40</a><u></u><u></u></p><p class=3D= "MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thank you for th= e review! <span style=3D"font-size:10.5pt;font-family:"Segoe UI",= sans-serif;color:rgb(36,41,46);background:white"><u></u><u></u></span></p><= p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thanks= ,<br>Kris<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><= div style=3D"border-right:none;border-bottom:none;border-left:none;border-t= op:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal"><= b>From: </b><a href=3D"mailto:[email protected]" target=3D"_blank">Howard Chu</= a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a href=3D"mailto:k= [email protected]" target=3D"_blank">[email protected]</a>; <a href=3D"mailt= o:[email protected]" target=3D"_blank">[email protected]</a= ><br><b>Subject: </b>Re: (ITS#9017) Improving performance of commit sync in= Windows<u></u><u></u></p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u>= </p><p class=3D"MsoNormal"><a href=3D"mailto:[email protected]" target=3D"_= blank">[email protected]</a> wrote:<u></u><u></u></p><p class=3D"MsoNormal"= >> Full_Name: Kristopher William Zyp<u></u><u></u></p><p class=3D"MsoNor= mal">> Version: LMDB 0.9.23<u></u><u></u></p><p class=3D"MsoNormal">>= OS: Windows<u></u><u></u></p><p class=3D"MsoNormal">> URL: <a href=3D"h= ttps://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332= b7fc4ce9" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/7ff= 525ae57684a163d32af74a0ab9332b7fc4ce9</a><u></u><u></u></p><p class=3D"MsoN= ormal">> Submission from: (NULL) (71.199.6.148)<u></u><u></u></p><p clas= s=3D"MsoNormal">> <u></u><u></u></p><p class=3D"MsoNormal">> <u></u><= u></u></p><p class=3D"MsoNormal">> We have seen very poor performance on= the sync of commits on large databases in<u></u><u></u></p><p class=3D"Mso= Normal">> Windows. On databases with 2GB of data, in writemap mode, the = sync of even small<u></u><u></u></p><p class=3D"MsoNormal">> commits is = consistently well over 100ms (without writemap it is faster, but<u></u><u><= /u></p><p class=3D"MsoNormal">> still slow). It is expected that a sync = should take some time while waiting for<u></u><u></u></p><p class=3D"MsoNor= mal">> disk confirmation of the writes, but more concerning is that thes= e sync<u></u><u></u></p><p class=3D"MsoNormal">> operations (in writemap= mode) are instead dominated by nearly 100% system CPU<u></u><u></u></p><p = class=3D"MsoNormal">> utilization, so operations that requires sub-milli= second b-tree update<u></u><u></u></p><p class=3D"MsoNormal">> operation= s are then dominated by very large amounts of system CPU cycles during<u></= u><u></u></p><p class=3D"MsoNormal">> the sync phase.<u></u><u></u></p><= p class=3D"MsoNormal">> <u></u><u></u></p><p class=3D"MsoNormal">> I = think that the fundamental problem is that FlushViewOfFile seems to be an O= (n)<u></u><u></u></p><p class=3D"MsoNormal">> operation where n is the s= ize of the file (or map). I presume that Windows is<u></u><u></u></p><p cla= ss=3D"MsoNormal">> scanning the entire map/file for dirty pages to flush= , I'm guessing because it<u></u><u></u></p><p class=3D"MsoNormal">> = doesn't have an internal index of all the dirty pages for every file/ma= p-view in<u></u><u></u></p><p class=3D"MsoNormal">> the OS disk cache. T= herefore, the turns into an extremely expensive, CPU-bound<u></u><u></u></p= ><p class=3D"MsoNormal">> operation to find the dirty pages for (large f= ile) and initiate their writes,<u></u><u></u></p><p class=3D"MsoNormal">>= ; which, of course, is contrary to the whole goal of a scalable database sy= stem.<u></u><u></u></p><p class=3D"MsoNormal">> And FlushFileBuffers is = also relatively slow as well. We have attempted to batch<u></u><u></u></p><= p class=3D"MsoNormal">> as many operations into single transaction as po= ssible, but this is still a very<u></u><u></u></p><p class=3D"MsoNormal">&g= t; large overhead.<u></u><u></u></p><p class=3D"MsoNormal">> <u></u><u><= /u></p><p class=3D"MsoNormal">> The Windows docs for FlushFileBuffers it= self warns about the inefficiencies of<u></u><u></u></p><p class=3D"MsoNorm= al">> this function (<a href=3D"https://docs.microsoft.com/en-us/windows= /desktop/api/fileapi/nf-fileapi-flushfilebuffers" target=3D"_blank">https:/= /docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfileb= uffers</a>).<u></u><u></u></p><p class=3D"MsoNormal">> Which also points= to the solution: it is much faster to write out the dirty<u></u><u></u></p= ><p class=3D"MsoNormal">> pages with WriteFile through a sync file handl= e (FILE_FLAG_WRITE_THROUGH).<u></u><u></u></p><p class=3D"MsoNormal">> <= u></u><u></u></p><p class=3D"MsoNormal">> The associated patch<u></u><u>= </u></p><p class=3D"MsoNormal">> (<a href=3D"https://github.com/kriszyp/= node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" target=3D"_blank= ">https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9= 332b7fc4ce9</a>)<u></u><u></u></p><p class=3D"MsoNormal">> is my attempt= at implementing this solution, for Windows. Fortunately, with the<u></u><u= ></u></p><p class=3D"MsoNormal">> design of LMDB, this is relatively str= aightforward. LMDB already supports<u></u><u></u></p><p class=3D"MsoNormal"= >> writing out dirty pages with WriteFile calls. I added a write-through= handle for<u></u><u></u></p><p class=3D"MsoNormal">> sending these writ= es directly to disk. I then made that file-handle<u></u><u></u></p><p class= =3D"MsoNormal">> overlapped/asynchronously, so all the writes for a comm= it could be started in<u></u><u></u></p><p class=3D"MsoNormal">> overlap= mode, and (at least theoretically) transfer in parallel to the drive and<u= ></u><u></u></p><p class=3D"MsoNormal">> then used GetOverlappedResult t= o wait for the completion. So basically<u></u><u></u></p><p class=3D"MsoNor= mal">> mdb_page_flush becomes the sync. I extended the writing of dirty = pages through<u></u><u></u></p><p class=3D"MsoNormal">> WriteFile to wri= temap mode as well (for writing meta too), so that WriteFile<u></u><u></u><= /p><p class=3D"MsoNormal">> with write-through can be used to flush the = data without ever needing to call<u></u><u></u></p><p class=3D"MsoNormal">&= gt; FlushViewOfFile or FlushFileBuffers. I also implemented support for wri= te<u></u><u></u></p><p class=3D"MsoNormal">> gathering in writemap mode = where contiguous file positions infers contiguous<u></u><u></u></p><p class= =3D"MsoNormal">> memory (by tracking the starting position with wdp and = writing contiguous pages<u></u><u></u></p><p class=3D"MsoNormal">> in si= ngle operations). Sorting of the dirty list is maintained even in writemap<= u></u><u></u></p><p class=3D"MsoNormal">> mode for this purpose.<u></u><= u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor= mal">What is the point of using writemap mode if you still need to use Writ= eFile<u></u><u></u></p><p class=3D"MsoNormal">on every individual page?<u><= /u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"Ms= oNormal">> The performance benefits of this patch, in my testing, are co= nsiderable. Writing<u></u><u></u></p><p class=3D"MsoNormal">> out/syncin= g transactions is typically over 5x faster in writemap mode, and 2x<u></u><= u></u></p><p class=3D"MsoNormal">> faster in standard mode. And perhaps = more importantly (especially in environment<u></u><u></u></p><p class=3D"Ms= oNormal">> with many threads/processes), the efficiency benefits are eve= n larger,<u></u><u></u></p><p class=3D"MsoNormal">> particularly in writ= emap mode, where there can be a 50-100x reduction in the<u></u><u></u></p><= p class=3D"MsoNormal">> system CPU usage by using this patch. This bring= s windows performance with<u></u><u></u></p><p class=3D"MsoNormal">> syn= c'ed transactions in LMDB back into the range of "lightning" = performance :).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u= ></p><p class=3D"MsoNormal">What is the performance difference between your= patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNormal">not= using writemap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"= ><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">-- <u></u><u></u></p><p cla= ss=3D"MsoNormal">=C2=A0=C2=A0-- Howard Chu<u></u><u></u></p><p class=3D"Mso= Normal">=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 <a href=3D"http://www.symas.com" target=3D"_blank">http:= //www.symas.com</a><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0 Director= , Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 <a href=3D"http://highlandsun.com/hy= c/" target=3D"_blank">http://highlandsun.com/hyc/</a><u></u><u></u></p><p c= lass=3D"MsoNormal">=C2=A0 Chief Architect, OpenLDAP=C2=A0 <a href=3D"http:/= /www.openldap.org/project/" target=3D"_blank">http://www.openldap.org/proje= ct/</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p = class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></div></div></blockquote></div> </blockquote></div> </blockquote></div> --000000000000bdb86d059deaf350--
