Re: About possibly reverting COUCHDB-767

Adam Kocoloski Mon, 08 Nov 2010 17:18:41 -0800

On Nov 8, 2010, at 3:28 PM, Randall Leeds wrote:

> On Mon, Nov 8, 2010 at 12:22, Paul Davis <paul.joseph.da...@gmail.com> wrote:
>> On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <randall.le...@gmail.com> 
>> wrote:
>>> Whoops. Hit send too early, but I think I got everything in there that
>>> I wanted to say.
>>> 
>>> As for the ref counter bottleneck, I just pushed to
>>> https://github.com/tilgovi/couchdb/tree/ets_ref_count
>>> This branch uses a public ets for the ref_counter. I think I managed
>>> to linear the updates over the {total, RefCtr} keys in the ets table
>>> such that there should be no race conditions but please, please take a
>>> look at this if you have time.
>>> 
>>> It seems to pass the ref_counter tests, but I still need to handle
>>> giving away ownership of the ets table. Right now I use couch_server
>>> as the heir so I can use only one ETS table for all couch_ref_counter
>>> processes, but the couch_server just crashes if it actually receives
>>> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
>>> table to another couch_ref_counter whenever the owner exits I may just
>>> break the encapsulation of the module a bit by leaving couch_server as
>>> the owner and ignoring that message.
>>> 
>>> Thanks, guys. My gut says we're going to get some nice numbers when
>>> all this is done.
>>> 
>>> -Randall
>>> 
>>> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <randall.le...@gmail.com> wrote:
>>>> Thanks to both of you for getting this conversation going again and
>>>> for the work on the patch and testing, Filipe.
>>>> 
>>>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <kocol...@apache.org> wrote:
>>>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>>>> 
>>>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <kocol...@apache.org> 
>>>>>> wrote:
>>>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>>>> 
>>>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocol...@apache.org> 
>>>>>>>> wrote:
>>>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>>>> 
>>>>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>>>>>>> 1' command) that when running a writes only test with relaximation
>>>>>>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>>>>> 
>>>>>>>>> I'm confused by this statement. You must be talking about 
>>>>>>>>> relaximation runs with delayed_commits = true, right?  Why do you 
>>>>>>>>> think you see larger intervals between write activity with the 
>>>>>>>>> optimization from COUCHDB-767?  Have you measured the time it takes 
>>>>>>>>> to open the extra FD?  In my tests that was a sub-millisecond 
>>>>>>>>> operation, but maybe you've uncovered something else.
>>>>>>>> 
>>>>>>>> No, it happens for tests with delayed_commits = false. The only
>>>>>>>> possible explanation I see for the variance might be related to the
>>>>>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>>>>>>> varies a lot.
>>>>>>> 
>>>>>>> I think it's worth investigating.  I couldn't reproduce it on my 
>>>>>>> plain-old spinning disk MacBook with 200 writers in relaximation; the 
>>>>>>> IOPS reported by iostat stayed very uniform.
>>>>>>> 
>>>>>>>>>> For the goal of not having readers getting blocked by fsync calls 
>>>>>>>>>> (and
>>>>>>>>>> write calls), I would propose using a separate couch_file process 
>>>>>>>>>> just
>>>>>>>>>> for read operations. I have a branch in my github for this (with
>>>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>>>>>>> tests are very positive, both reads and writes get better response
>>>>>>>>>> times and throughput:
>>>>>>>>>> 
>>>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>>>>>> 
>>>>>>>>> I'd like to propose an alternative optimization, which is to keep a 
>>>>>>>>> dedicated file descriptor open in the couch_db_updater process and 
>>>>>>>>> use that file descriptor for _all_ IO initiated by the db_updater.  
>>>>>>>>> The advantage is that the db_updater does not need to do any message 
>>>>>>>>> passing for disk IO, and thus does not slow down when the incoming 
>>>>>>>>> message queue is large.  A message queue much much larger than the 
>>>>>>>>> number of concurrent writers can occur if a user writes with 
>>>>>>>>> batch=ok, and it can also happen rather easily in a BigCouch cluster.
>>>>>>>> 
>>>>>>>> I don't see how that will improve things, since all write operations
>>>>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>>>>> writes to the DB file, and since access to the couch_db_updater is
>>>>>>>> serialized, to me it only seems that you're solution avoids one level
>>>>>>>> of indirection (the couch_file process). I don't see how, when using a
>>>>>>>> couch_file only for writes, you get the message queue for that
>>>>>>>> couc_file process full of write messages.
>>>>>>> 
>>>>>>> It's the db_updater which gets a large message queue, not the 
>>>>>>> couch_file.  The db_updater ends up with a big backlog of update_docs 
>>>>>>> messages that get in the way when it needs to make gen_server calls to 
>>>>>>> the couch_file process for IO.  It's a significant problem in R13B, 
>>>>>>> probably less so in R14B because of some cool optimizations by the OTP 
>>>>>>> team.
>>>>>> 
>>>>>> So, let me see if I get it. The couch_db_updater process is slow
>>>>>> picking the results of the calls to the couch_file process because its
>>>>>> mailbox is full of update_docs messages?
>>>>> 
>>>>> Correct.  Each call to the couch_file requires a selective receive on the 
>>>>> part of the db_updater in order to get the response, and prior to R14 
>>>>> that selective receive needed to match against every message in the 
>>>>> mailbox.  It's really a bigger problem in couch_server, which uses a 
>>>>> gen_server call to increment a reference counter before handing the #db{} 
>>>>> to the client, since every request to any DB has to talk to couch_server 
>>>>> first.  Best,
>>>>> 
>>>>> Adam
>>>> 
>>>> Adam,
>>>> I think the problem is made worse by a backed up db_updater, but the
>>>> db_updater becomes backed up because it makes more synchronous calls
>>>> to the couch_file than a reader does, handling only one update
>>>> operation at a time while readers queue up on the couch_file in
>>>> parallel.
>>>> 
>>>> Filipe,
>>>> Using a separate fd for writes at the couch_file level is not the
>>>> answer. The db_updater has to read the btree before it can write,
>>>> incurring multiple trips through the couch_file message queue between
>>>> queuing append_term requests and processing its message queue for new
>>>> updates. Using two file descriptors keeps the readers out of the way
>>>> of the writers only if you select which fd to use at the db-operation
>>>> level and not the file-operation level. Perhaps two couch_file
>>>> processes is better. Fairness should be left to the operating system
>>>> I/O scheduler once reads don'. This seems seems like the best way
>>>> forward to me right now. Let's try to crunch some numbers on it soon.
>>>> 
>>>> I couldn't find a solution I liked that was fair to readers and
>>>> writers at any workload with only one file descriptor. The btree cache
>>>> alleviates this problem a bit because the read path becomes much
>>>> faster and therefore improves database reads and writes.
>>>> 
>>>> As to the patch, I'd think we need the readers and writers separated
>>>> into two separate couch_files. That way the updater can perform its
>>>> reads on the "writer" fd, otherwise writers suffer starvation because
>>>> readers go directly into the couch_file queue in parallel instead of
>>>> serializing through something like db_updater.
>>>> 
>>> 
>> 
>> Wasn't there a branch or patch somehwere that just removed the
>> ref_counter code entirely and used monitors/links to make sure
>> everything behaved correctly? I'm not sure I ever saw it to see how
>> dramatic and/or scary it was, but it might be another approach to
>> consider.
>> 
> 
> Adam should chime in. I think BigCouch got rid of the ref counter in
> favor of something else, but last I asked him about it he said there
> might be a small edge case race condition. How critical that is I
> can't say. It may be that that edge case is tolerable and recoverable.


BigCouch uses a protected couch_dbs ets table to store all open #dbs and a 
public couch_lru ets table for LRU updates.  Reference counting is accomplished 
by having clients monitor and demonitor the couch_file.  The least recently 
used database is determined by folding over the couch_lru table.  The race 
condition Randall is referring to is the following:

1) client process grabs a #db{} record from the couch_dbs ets table
2) couch_server calculates that the LRU DB is the one the client just looked up
3) couch_server checks the monitored_by field for the #db.fd process and finds 
no clients are monitoring it
4) couch_server kills the DB
5) client updates the LRU time (too late)
6) client monitors the #db.fd, but it's already dead

Practically speaking, it's been a non-issue, but it does exist.  For what it's 
worth, a similar race condition existed in older (0.10.x and below) versions of 
CouchDB, where the client would increment the ref counter after receiving the 
DB from couch_server.  The current implementation in CouchDB avoids the race 
condition by having couch_server increment the ref counter on behalf of the 
client, but that approach has serious performance and stability implications 
for BigCouch.

I'll try to take a look at Randall's github branch tonight.  Using monitors for 
reference counting feels really right to me, but as it's not possible to 
monitor a process on behalf of someone else I'm not sure it's possible to avoid 
the race that I described with monitors.  Regards,

Adam

Re: About possibly reverting COUCHDB-767

Reply via email to