On Nov 8, 2010, at 3:28 PM, Randall Leeds wrote: > On Mon, Nov 8, 2010 at 12:22, Paul Davis <paul.joseph.da...@gmail.com> wrote: >> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <randall.le...@gmail.com> >> wrote: >>> Whoops. Hit send too early, but I think I got everything in there that >>> I wanted to say. >>> >>> As for the ref counter bottleneck, I just pushed to >>> https://github.com/tilgovi/couchdb/tree/ets_ref_count >>> This branch uses a public ets for the ref_counter. I think I managed >>> to linear the updates over the {total, RefCtr} keys in the ets table >>> such that there should be no race conditions but please, please take a >>> look at this if you have time. >>> >>> It seems to pass the ref_counter tests, but I still need to handle >>> giving away ownership of the ets table. Right now I use couch_server >>> as the heir so I can use only one ETS table for all couch_ref_counter >>> processes, but the couch_server just crashes if it actually receives >>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the >>> table to another couch_ref_counter whenever the owner exits I may just >>> break the encapsulation of the module a bit by leaving couch_server as >>> the owner and ignoring that message. >>> >>> Thanks, guys. My gut says we're going to get some nice numbers when >>> all this is done. >>> >>> -Randall >>> >>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <randall.le...@gmail.com> wrote: >>>> Thanks to both of you for getting this conversation going again and >>>> for the work on the patch and testing, Filipe. >>>> >>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <kocol...@apache.org> wrote: >>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote: >>>>> >>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <kocol...@apache.org> >>>>>> wrote: >>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote: >>>>>>> >>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocol...@apache.org> >>>>>>>> wrote: >>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote: >>>>>>>>> >>>>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat >>>>>>>>>> 1' command) that when running a writes only test with relaximation >>>>>>>>>> (200 write processes), disk write activity is not continuous. Without >>>>>>>>>> this patch, there's continuous (every 1 second) write activity. >>>>>>>>> >>>>>>>>> I'm confused by this statement. You must be talking about >>>>>>>>> relaximation runs with delayed_commits = true, right? Why do you >>>>>>>>> think you see larger intervals between write activity with the >>>>>>>>> optimization from COUCHDB-767? Have you measured the time it takes >>>>>>>>> to open the extra FD? In my tests that was a sub-millisecond >>>>>>>>> operation, but maybe you've uncovered something else. >>>>>>>> >>>>>>>> No, it happens for tests with delayed_commits = false. The only >>>>>>>> possible explanation I see for the variance might be related to the >>>>>>>> Erlang VM scheduler decisions about when to start/run that process. >>>>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency >>>>>>>> varies a lot. >>>>>>> >>>>>>> I think it's worth investigating. I couldn't reproduce it on my >>>>>>> plain-old spinning disk MacBook with 200 writers in relaximation; the >>>>>>> IOPS reported by iostat stayed very uniform. >>>>>>> >>>>>>>>>> For the goal of not having readers getting blocked by fsync calls >>>>>>>>>> (and >>>>>>>>>> write calls), I would propose using a separate couch_file process >>>>>>>>>> just >>>>>>>>>> for read operations. I have a branch in my github for this (with >>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation >>>>>>>>>> tests are very positive, both reads and writes get better response >>>>>>>>>> times and throughput: >>>>>>>>>> >>>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads >>>>>>>>> >>>>>>>>> I'd like to propose an alternative optimization, which is to keep a >>>>>>>>> dedicated file descriptor open in the couch_db_updater process and >>>>>>>>> use that file descriptor for _all_ IO initiated by the db_updater. >>>>>>>>> The advantage is that the db_updater does not need to do any message >>>>>>>>> passing for disk IO, and thus does not slow down when the incoming >>>>>>>>> message queue is large. A message queue much much larger than the >>>>>>>>> number of concurrent writers can occur if a user writes with >>>>>>>>> batch=ok, and it can also happen rather easily in a BigCouch cluster. >>>>>>>> >>>>>>>> I don't see how that will improve things, since all write operations >>>>>>>> will still be done in a serialized manner. Since only couch_db_updater >>>>>>>> writes to the DB file, and since access to the couch_db_updater is >>>>>>>> serialized, to me it only seems that you're solution avoids one level >>>>>>>> of indirection (the couch_file process). I don't see how, when using a >>>>>>>> couch_file only for writes, you get the message queue for that >>>>>>>> couc_file process full of write messages. >>>>>>> >>>>>>> It's the db_updater which gets a large message queue, not the >>>>>>> couch_file. The db_updater ends up with a big backlog of update_docs >>>>>>> messages that get in the way when it needs to make gen_server calls to >>>>>>> the couch_file process for IO. It's a significant problem in R13B, >>>>>>> probably less so in R14B because of some cool optimizations by the OTP >>>>>>> team. >>>>>> >>>>>> So, let me see if I get it. The couch_db_updater process is slow >>>>>> picking the results of the calls to the couch_file process because its >>>>>> mailbox is full of update_docs messages? >>>>> >>>>> Correct. Each call to the couch_file requires a selective receive on the >>>>> part of the db_updater in order to get the response, and prior to R14 >>>>> that selective receive needed to match against every message in the >>>>> mailbox. It's really a bigger problem in couch_server, which uses a >>>>> gen_server call to increment a reference counter before handing the #db{} >>>>> to the client, since every request to any DB has to talk to couch_server >>>>> first. Best, >>>>> >>>>> Adam >>>> >>>> Adam, >>>> I think the problem is made worse by a backed up db_updater, but the >>>> db_updater becomes backed up because it makes more synchronous calls >>>> to the couch_file than a reader does, handling only one update >>>> operation at a time while readers queue up on the couch_file in >>>> parallel. >>>> >>>> Filipe, >>>> Using a separate fd for writes at the couch_file level is not the >>>> answer. The db_updater has to read the btree before it can write, >>>> incurring multiple trips through the couch_file message queue between >>>> queuing append_term requests and processing its message queue for new >>>> updates. Using two file descriptors keeps the readers out of the way >>>> of the writers only if you select which fd to use at the db-operation >>>> level and not the file-operation level. Perhaps two couch_file >>>> processes is better. Fairness should be left to the operating system >>>> I/O scheduler once reads don'. This seems seems like the best way >>>> forward to me right now. Let's try to crunch some numbers on it soon. >>>> >>>> I couldn't find a solution I liked that was fair to readers and >>>> writers at any workload with only one file descriptor. The btree cache >>>> alleviates this problem a bit because the read path becomes much >>>> faster and therefore improves database reads and writes. >>>> >>>> As to the patch, I'd think we need the readers and writers separated >>>> into two separate couch_files. That way the updater can perform its >>>> reads on the "writer" fd, otherwise writers suffer starvation because >>>> readers go directly into the couch_file queue in parallel instead of >>>> serializing through something like db_updater. >>>> >>> >> >> Wasn't there a branch or patch somehwere that just removed the >> ref_counter code entirely and used monitors/links to make sure >> everything behaved correctly? I'm not sure I ever saw it to see how >> dramatic and/or scary it was, but it might be another approach to >> consider. >> > > Adam should chime in. I think BigCouch got rid of the ref counter in > favor of something else, but last I asked him about it he said there > might be a small edge case race condition. How critical that is I > can't say. It may be that that edge case is tolerable and recoverable.
BigCouch uses a protected couch_dbs ets table to store all open #dbs and a public couch_lru ets table for LRU updates. Reference counting is accomplished by having clients monitor and demonitor the couch_file. The least recently used database is determined by folding over the couch_lru table. The race condition Randall is referring to is the following: 1) client process grabs a #db{} record from the couch_dbs ets table 2) couch_server calculates that the LRU DB is the one the client just looked up 3) couch_server checks the monitored_by field for the #db.fd process and finds no clients are monitoring it 4) couch_server kills the DB 5) client updates the LRU time (too late) 6) client monitors the #db.fd, but it's already dead Practically speaking, it's been a non-issue, but it does exist. For what it's worth, a similar race condition existed in older (0.10.x and below) versions of CouchDB, where the client would increment the ref counter after receiving the DB from couch_server. The current implementation in CouchDB avoids the race condition by having couch_server increment the ref counter on behalf of the client, but that approach has serious performance and stability implications for BigCouch. I'll try to take a look at Randall's github branch tonight. Using monitors for reference counting feels really right to me, but as it's not possible to monitor a process on behalf of someone else I'm not sure it's possible to avoid the race that I described with monitors. Regards, Adam