On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote: > On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocol...@apache.org> wrote: >> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote: >> >>> Also, with this patch I verified (on Solaris, with the 'zpool iostat >>> 1' command) that when running a writes only test with relaximation >>> (200 write processes), disk write activity is not continuous. Without >>> this patch, there's continuous (every 1 second) write activity. >> >> I'm confused by this statement. You must be talking about relaximation runs >> with delayed_commits = true, right? Why do you think you see larger >> intervals between write activity with the optimization from COUCHDB-767? >> Have you measured the time it takes to open the extra FD? In my tests that >> was a sub-millisecond operation, but maybe you've uncovered something else. > > No, it happens for tests with delayed_commits = false. The only > possible explanation I see for the variance might be related to the > Erlang VM scheduler decisions about when to start/run that process. > Nevertheless, I dont know the exact cause, but the fsync run frequency > varies a lot.
I think it's worth investigating. I couldn't reproduce it on my plain-old spinning disk MacBook with 200 writers in relaximation; the IOPS reported by iostat stayed very uniform. >>> For the goal of not having readers getting blocked by fsync calls (and >>> write calls), I would propose using a separate couch_file process just >>> for read operations. I have a branch in my github for this (with >>> COUCHDB-767 reverted). It needs to be polished, but the relaximation >>> tests are very positive, both reads and writes get better response >>> times and throughput: >>> >>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads >> >> I'd like to propose an alternative optimization, which is to keep a >> dedicated file descriptor open in the couch_db_updater process and use that >> file descriptor for _all_ IO initiated by the db_updater. The advantage is >> that the db_updater does not need to do any message passing for disk IO, and >> thus does not slow down when the incoming message queue is large. A message >> queue much much larger than the number of concurrent writers can occur if a >> user writes with batch=ok, and it can also happen rather easily in a >> BigCouch cluster. > > I don't see how that will improve things, since all write operations > will still be done in a serialized manner. Since only couch_db_updater > writes to the DB file, and since access to the couch_db_updater is > serialized, to me it only seems that you're solution avoids one level > of indirection (the couch_file process). I don't see how, when using a > couch_file only for writes, you get the message queue for that > couc_file process full of write messages. It's the db_updater which gets a large message queue, not the couch_file. The db_updater ends up with a big backlog of update_docs messages that get in the way when it needs to make gen_server calls to the couch_file process for IO. It's a significant problem in R13B, probably less so in R14B because of some cool optimizations by the OTP team. > Also, what I did on that branch is a bit more generic, as it works for > view index files as well, and doesn't introduce significant changes > elsewhere except in couch_file.erl. Of course your solution might be > extended to the view updater process as well easily, I don't have > anything against it. > > Anyway, +1. I do like that the work you did applies immediately to the view group files. Applying what I'm proposing to the view updater would probably be easy, but not "zero lines changed" easy. On the other hand, the problem I'm trying to avoid is a non-issue with views, since they're never updated directly by clients. Best, Adam