Bron Gondwana wrote: > Hi All, > > We were debugging the CPU usage in a ctl_conversationsdb rebuild yesterday, > and noticed an interesting thing. 70% of the CPU utilisation for this one > process > was inside the kernel! Mostly with dirty pages. > > ctl_conversationsdb -R is particularly heavy on the twoskip database - it's > rewriting a lot of random keys. This leads to writes all over the place, as > it > stitches records into the skiplists. > > Of course the "real answer"[tm] is zeroskip, which doesn't do random writes - > but until then, we suspect that the cost is largely due to the face that we > use > mmap to read, and fwrite to write! We know that might be less efficient > already from Linus' comments about 10 years ago! And I guess here's the > proof. > > An option would be to switch to using mmap to write as well. We could easily > modify lib/mappedfile to memcpy to do the writes. > > Does anybody see any strong reason not to?
I've covered the reasons for/against writing thru mmap in my LMDB design papers. I don't know how relevant all of these are for your use case: 1: writing thru mmap loses any control over write ordering - the OS will page dirty pages out in arbitrary order. If you're using a filesystem that supports ordered writes, it will preserve the ordering of data from write() calls. 2: making the mmap writable opens the possibility of undetectable data structure corruption if any other code is doing stray writes through arbitrary pointers. You need to be very sure your code is bug-free. 3: if your DB is larger than RAM, writing thru mmap is slower than using write() syscalls. Whenever you access a page for the first time, the OS will page it in. This is a wasted I/O if all you're doing is overwriting the page with new data. 4: you can't use mmap exclusively, if you need to grow the output file. You can only write thru the mapping to pages that already exist. If you need to grow the file, you must preallocate the space, otherwise you get a SEGV when referencing unallocated pages. And a side note, multiple studies have shown that skiplists are not cache-friendly, and thus have inferior performance to B+tree organizations. A skiplist is a very poor choice for a read/write data structure. Obviously I would recommend you use something carefully designed and heavily tested, like LMDB, instead of whatever you're using. There's one point in favor of writing thru mmap - if you take care of all the other potential gotchas, it will work on every OS that implements mmap. Using mmap for reads, and syscalls for writes, is only valid on OSs with a unified buffer cache. While this isn't a problem on most modern OSs, OpenBSD is a notable example of an OS that lacks this, and so that approach always results in file corruption there. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/