Re: 100% CPU on only a single node because of couchjs processes

Geoffrey Cox Mon, 11 Dec 2017 16:49:00 -0800

I've finally managed to narrow this down to a reproducible set of scripts
and have created a GH issue: https://github.com/apache/couchdb/issues/1063


The summary is that I believe there is a resource leak in CouchDB when you
abort continuous listening on the _global_changes database. Fortunately,
there appears to be a workaround (if your use case supports it) where you
can use feed=longpoll instead of feed=continuous

I hope this saves someone else some headache!

On Tue, Dec 5, 2017 at 2:48 PM Geoffrey Cox <redge...@gmail.com> wrote:

> Hi Adam, quick follow-up: is it possible that writes can also be designed
> to a "primary" node, like the `stale` option for a view? I was originally
> thinking that the issue is with reading data via a view, but now I'm
> thinking it may be related to writing data and those writes somehow
> triggering these persistent and heavyweight couchjs processes. It's tough
> to say as I'd imagine that you don't have couchjs load unless you have
> frequent writing and reading. I'm still trying to isolate the issue and it
> is difficult as the problem only seems to happen in a production env and
> only with *all* the production code, figures ;)
>
> On Tue, Dec 5, 2017 at 12:13 PM Geoffrey Cox <redge...@gmail.com> wrote:
>
>> Hey Adam,
>>
>> Attached is my local.ini and the design doc with the view JS.
>>
>> Please see my responses below:
>>
>> Thanks for the help!
>>
>> On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <kocol...@apache.org>
>> wrote:
>>
>>> Hi Geoff, a couple of additional questions:
>>>
>>> 1) Are you making these view requests with stale=ok or
>>> stale=update_after?
>>>
>> GC: I am not using the stale parameter
>>
>>> 2) What are you using for N and Q in the [cluster] configuration
>>> settings?
>>>
>> GC: As per the attached local.ini, I specified n=2 and am using the
>> default q=8.
>>
>>> 3) Did you take advantage of the (barely-documented) “zones" attribute
>>> when defining cluster members?
>>>
>> GC: As per the attached local.ini, I have *not* specified this option.
>>
>>> 4) Do you have any other JS code besides the view definitions?
>>>
>> GC: When you refer to JS code, I think you mean in terms of JS code "in"
>> CouchDB and if that is the case then my only JS code is very simple views
>> like those in the attached view.json. (I know that I really need to break
>> out the views so that there is one view per doc, but I haven't quite gotten
>> around to refactoring this and I don't believe this is causing the CPU
>> usage)
>>
>>>
>>> Regarding #1, the cluster will actually select shards differently
>>> depending on the use of those query parameters. When your request
>>> stipulates that you’re OK with stale results the cluster *will* select a
>>> “primary” copy in order to improve the consistency of repeated requests to
>>> the same view. The algorithm for choosing those primary copies is somewhat
>>> subtle hence my question #3.
>>>
>>> If you’re not using stale requests I have a much harder time explaining
>>> why the 100% CPU issue would migrate from node to node like that.
>>>
>>> Adam
>>>
>>> > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <redge...@gmail.com> wrote:
>>> >
>>> > Thanks for the responses, any other thoughts?
>>> >
>>> > FYI: I’m trying to work on a very focused test case that I can share
>>> with
>>> > the Dev team, but it is taking a little while to narrow down the exact
>>> > cause.
>>> > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <
>>> rnew...@apache.org>
>>> > wrote:
>>> >
>>> >> Sorry to contradict you, but Cloudant deploys clusters across amazon
>>> AZ's
>>> >> as standard. It's fast enough. It's cross-region that you need to
>>> avoid.
>>> >>
>>> >> B.
>>> >>
>>> >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <j...@apache.org> wrote:
>>> >>>
>>> >>> Heya Geoff,
>>> >>>
>>> >>> a CouchDB cluster is designed to run in the same data center / with
>>> >> local are networking latencies. A cluster across AWS Availability
>>> Zones
>>> >> won’t work as you see. If you want CouchDB’s in both AZs, use regular
>>> >> replication and keep the clusters local to the AZ.
>>> >>>
>>> >>> Best
>>> >>> Jan
>>> >>> --
>>> >>>
>>> >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <redge...@gmail.com> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I've spent days using trial and error to try and figure out why I am
>>> >>>> getting a very high CPU load on only a single node in my cluster.
>>> I'm
>>> >>>> hoping someone has an idea of what is going on as I'm getting stuck.
>>> >>>>
>>> >>>> Here's my configuration:
>>> >>>>
>>> >>>> 1. 2 node cluster:
>>> >>>>    1. Each node is located in a different AWS availability zone
>>> >>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>>> >>>> 2. A haproxy server is load balancing traffic to the nodes using
>>> round
>>> >>>> robin
>>> >>>>
>>> >>>> The problem:
>>> >>>>
>>> >>>> 1. After users make changes via PouchDB, a backend runs a number of
>>> >>>> routines that use views to calculate notifications. The issue is
>>> that
>>> >> on a
>>> >>>> single node, the couchjs processes stack up and then start to
>>> consume
>>> >>>> nearly all the available CPU. This server then becomes the
>>> "workhorse"
>>> >> that
>>> >>>> always does *all* the heavy duty couchjs processing until I restart
>>> >> this
>>> >>>> node.
>>> >>>> 2. It is important to note that both nodes have couchjs processes,
>>> but
>>> >>>> it is only a single node that has the couchjs processes that are
>>> using
>>> >> 100%
>>> >>>> CPU
>>> >>>> 3. I've even resorted to setting `os_process_limit = 10` and this
>>> just
>>> >>>> results in each couchjs process taking over 10% each! In other
>>> words,
>>> >> the
>>> >>>> couchjs processes just eat up all the CPU no matter how many couchjs
>>> >>>> process there are!
>>> >>>> 4. The CPU usage will eventually clear after all the processing is
>>> >> done,
>>> >>>> but then as soon as there is more to process the workhorse node will
>>> >> get
>>> >>>> bogged down again.
>>> >>>> 5. If I restart the workhorse node, the other node then becomes the
>>> >>>> workhorse node. This is the only way to get the couchjs processes to
>>> >> "move"
>>> >>>> to another node.
>>> >>>> 6. The problem is that this design is not scalable as only one node
>>> can
>>> >>>> be the workhorse node at any given time. Moreover this causes
>>> specific
>>> >>>> instances to run out of CPU credits. Shouldn't the couchjs
>>> processes be
>>> >>>> spread out over all my nodes? From what I can tell, if I add more
>>> >> nodes I'm
>>> >>>> still going to have the issue where only one of the nodes is getting
>>> >> bogged
>>> >>>> down. Is it possible that the problem is that I have 2 nodes and
>>> >> really I
>>> >>>> need at least 3 nodes? (I know a 2-node cluster is not very typical)
>>> >>>>
>>> >>>>
>>> >>>> Things I've checked:
>>> >>>>
>>> >>>> 1. Ensured that the load balancing is working, i.e. haproxy is
>>> indeed
>>> >>>> distributing traffic accordingly
>>> >>>> 2. I've tried setting `os_process_limit = 10` and
>>> >> `os_process_soft_limit
>>> >>>> = 5` to see if I could force a more conservative usage of couchjs
>>> >>>> processes, but instead the couchjs processes just consume all the
>>> CPU
>>> >> load.
>>> >>>> 3. I've tried simulating the issue locally with VMs and I cannot
>>> >>>> duplicate any such load. My guess is that this is because the nodes
>>> are
>>> >>>> located on the same box so hop distance between nodes is very small
>>> and
>>> >>>> this somehow keeps the CPU usage to a minimum
>>> >>>> 4. I've tried isolating the issue by creating short code snippets
>>> that
>>> >>>> intentionally try to spawn a lot of couchjs processes and they are
>>> >> spawned
>>> >>>> but don't consume 100% CPU
>>> >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and
>>> this
>>> >>>> doesn't seem to change anything
>>> >>>> 6. The only error entries in my CouchDB logs are like the following
>>> and
>>> >>>> I don't believe they are related to my issue:
>>> >>>>    1.
>>> >>>>
>>> >>>>    [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32
>>> >> <0.13974.79>
>>> >>>>    4b0b21c664 rexi_server: from: couchdb@172.31.83.32(<0.20638.79>)
>>> >> mfa:
>>> >>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed
>>> to
>>> >> access
>>> >>>>    this db.">>}
>>> >>>>
>>> >>
>>> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>>> >>>>
>>> >>>> Does CouchDB have some logic built in that spawns a number of
>>> couchjs
>>> >>>> processes on a "primary" node? Will future view processing then
>>> always
>>> >> be
>>> >>>> routed to this "primary" node?
>>> >>>>
>>> >>>> Is there a way to better distribute these heavy duty couchjs
>>> processes?
>>> >> Is
>>> >>>> it possible to limit their CPU consumption? (I'm hesitant to start
>>> down
>>> >> the
>>> >>>> path of using something like cpulimit as I think there is a root
>>> problem
>>> >>>> that needs to be addressed)
>>> >>>>
>>> >>>> I'm running out of ideas and hope that someone has some notion of
>>> what
>>> >> is
>>> >>>> causing this bizarre load or if there is a bug in CouchDB.
>>> >>>>
>>> >>>> Thank you for any help you can provide!
>>> >>>>
>>> >>>> Geoff
>>> >>>
>>> >>> --
>>> >>> Professional Support for Apache CouchDB:
>>> >>> https://neighbourhood.ie/couchdb-support/
>>> >>>
>>> >>
>>> >>
>>>
>>>

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to