Re: 100% CPU on only a single node because of couchjs processes

Geoffrey Cox Sat, 09 Dec 2017 17:01:31 -0800

Well, I'm back at this and here is the latest info and I think it may be
related to writes and the _global_changes database:


   1. I run my production env and one of my nodes becomes the "workhorse"
   node with 100% CPU
   2. I stop all my production code from generating any more CouchDB
   requests and eventually the workhorse node goes back to 0% CPU
   3. I can then issue writes on a single database (really any database and
   ANY node--not just the workhorse node) and the workhorse node will kick
   back up to 100% CPU. If I stop the writes, the workhorse node will return
   to 0% CPU.
   4. And now the punch line: if I delete the _global_changes database, the
   CPU drops down to 0% even if I am issuing writes! Pure cray cray

Any thoughts?

(Sorry, still working on a reproducible env for everyone)

On Wed, Dec 6, 2017 at 6:56 AM Geoffrey Cox <redge...@gmail.com> wrote:

> Interesting, I read somewhere that having a view per ddoc is more
> efficient. Thanks for clarifying!
> On Wed, Dec 6, 2017 at 1:31 AM Jan Lehnardt <j...@apache.org> wrote:
>
>>
>> > On 5. Dec 2017, at 21:13, Geoffrey Cox <redge...@gmail.com> wrote:
>> >
>> > Hey Adam,
>> >
>> > Attached is my local.ini and the design doc with the view JS.
>> >
>> > Please see my responses below:
>> >
>> > Thanks for the help!
>> >
>> > On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <kocol...@apache.org>
>> wrote:
>> > Hi Geoff, a couple of additional questions:
>> >
>> > 1) Are you making these view requests with stale=ok or
>> stale=update_after?
>> > GC: I am not using the stale parameter
>> > 2) What are you using for N and Q in the [cluster] configuration
>> settings?
>> > GC: As per the attached local.ini, I specified n=2 and am using the
>> default q=8.
>> > 3) Did you take advantage of the (barely-documented) “zones" attribute
>> when defining cluster members?
>> > GC: As per the attached local.ini, I have *not* specified this option.
>> > 4) Do you have any other JS code besides the view definitions?
>> > GC: When you refer to JS code, I think you mean in terms of JS code
>> "in" CouchDB and if that is the case then my only JS code is very simple
>> views like those in the attached view.json. (I know that I really need to
>> break out the views so that there is one view per doc, but I haven't quite
>> gotten around to refactoring this and I don't believe this is causing the
>> CPU usage)
>>
>> Quick comment on one or multiple view(s)-per-ddoc: this is a performance
>> trade-off and not either one is always correct. But generally, I would
>> recommend grouping all views an app would need into a single ddoc.
>>
>> For each ddoc, all docs in a database have to be serialised and shipped
>> to couchjs and the results are shipped back, that’s the bulk of the work in
>> view indexing. Evaluating a single map/reduce function is comparatively
>> minuscule, so grouping views in a single ddoc makes that more efficient.
>>
>>
>>
>> >
>> > Regarding #1, the cluster will actually select shards differently
>> depending on the use of those query parameters. When your request
>> stipulates that you’re OK with stale results the cluster *will* select a
>> “primary” copy in order to improve the consistency of repeated requests to
>> the same view. The algorithm for choosing those primary copies is somewhat
>> subtle hence my question #3.
>> >
>> > If you’re not using stale requests I have a much harder time explaining
>> why the 100% CPU issue would migrate from node to node like that.
>> >
>> > Adam
>> >
>> > > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <redge...@gmail.com> wrote:
>> > >
>> > > Thanks for the responses, any other thoughts?
>> > >
>> > > FYI: I’m trying to work on a very focused test case that I can share
>> with
>> > > the Dev team, but it is taking a little while to narrow down the exact
>> > > cause.
>> > > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <
>> rnew...@apache.org>
>> > > wrote:
>> > >
>> > >> Sorry to contradict you, but Cloudant deploys clusters across amazon
>> AZ's
>> > >> as standard. It's fast enough. It's cross-region that you need to
>> avoid.
>> > >>
>> > >> B.
>> > >>
>> > >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <j...@apache.org> wrote:
>> > >>>
>> > >>> Heya Geoff,
>> > >>>
>> > >>> a CouchDB cluster is designed to run in the same data center / with
>> > >> local are networking latencies. A cluster across AWS Availability
>> Zones
>> > >> won’t work as you see. If you want CouchDB’s in both AZs, use regular
>> > >> replication and keep the clusters local to the AZ.
>> > >>>
>> > >>> Best
>> > >>> Jan
>> > >>> --
>> > >>>
>> > >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <redge...@gmail.com> wrote:
>> > >>>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> I've spent days using trial and error to try and figure out why I
>> am
>> > >>>> getting a very high CPU load on only a single node in my cluster.
>> I'm
>> > >>>> hoping someone has an idea of what is going on as I'm getting
>> stuck.
>> > >>>>
>> > >>>> Here's my configuration:
>> > >>>>
>> > >>>> 1. 2 node cluster:
>> > >>>>    1. Each node is located in a different AWS availability zone
>> > >>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>> > >>>> 2. A haproxy server is load balancing traffic to the nodes using
>> round
>> > >>>> robin
>> > >>>>
>> > >>>> The problem:
>> > >>>>
>> > >>>> 1. After users make changes via PouchDB, a backend runs a number of
>> > >>>> routines that use views to calculate notifications. The issue is
>> that
>> > >> on a
>> > >>>> single node, the couchjs processes stack up and then start to
>> consume
>> > >>>> nearly all the available CPU. This server then becomes the
>> "workhorse"
>> > >> that
>> > >>>> always does *all* the heavy duty couchjs processing until I restart
>> > >> this
>> > >>>> node.
>> > >>>> 2. It is important to note that both nodes have couchjs processes,
>> but
>> > >>>> it is only a single node that has the couchjs processes that are
>> using
>> > >> 100%
>> > >>>> CPU
>> > >>>> 3. I've even resorted to setting `os_process_limit = 10` and this
>> just
>> > >>>> results in each couchjs process taking over 10% each! In other
>> words,
>> > >> the
>> > >>>> couchjs processes just eat up all the CPU no matter how many
>> couchjs
>> > >>>> process there are!
>> > >>>> 4. The CPU usage will eventually clear after all the processing is
>> > >> done,
>> > >>>> but then as soon as there is more to process the workhorse node
>> will
>> > >> get
>> > >>>> bogged down again.
>> > >>>> 5. If I restart the workhorse node, the other node then becomes the
>> > >>>> workhorse node. This is the only way to get the couchjs processes
>> to
>> > >> "move"
>> > >>>> to another node.
>> > >>>> 6. The problem is that this design is not scalable as only one
>> node can
>> > >>>> be the workhorse node at any given time. Moreover this causes
>> specific
>> > >>>> instances to run out of CPU credits. Shouldn't the couchjs
>> processes be
>> > >>>> spread out over all my nodes? From what I can tell, if I add more
>> > >> nodes I'm
>> > >>>> still going to have the issue where only one of the nodes is
>> getting
>> > >> bogged
>> > >>>> down. Is it possible that the problem is that I have 2 nodes and
>> > >> really I
>> > >>>> need at least 3 nodes? (I know a 2-node cluster is not very
>> typical)
>> > >>>>
>> > >>>>
>> > >>>> Things I've checked:
>> > >>>>
>> > >>>> 1. Ensured that the load balancing is working, i.e. haproxy is
>> indeed
>> > >>>> distributing traffic accordingly
>> > >>>> 2. I've tried setting `os_process_limit = 10` and
>> > >> `os_process_soft_limit
>> > >>>> = 5` to see if I could force a more conservative usage of couchjs
>> > >>>> processes, but instead the couchjs processes just consume all the
>> CPU
>> > >> load.
>> > >>>> 3. I've tried simulating the issue locally with VMs and I cannot
>> > >>>> duplicate any such load. My guess is that this is because the
>> nodes are
>> > >>>> located on the same box so hop distance between nodes is very
>> small and
>> > >>>> this somehow keeps the CPU usage to a minimum
>> > >>>> 4. I've tried isolating the issue by creating short code snippets
>> that
>> > >>>> intentionally try to spawn a lot of couchjs processes and they are
>> > >> spawned
>> > >>>> but don't consume 100% CPU
>> > >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and
>> this
>> > >>>> doesn't seem to change anything
>> > >>>> 6. The only error entries in my CouchDB logs are like the
>> following and
>> > >>>> I don't believe they are related to my issue:
>> > >>>>    1.
>> > >>>>
>> > >>>>    [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32
>> > >> <0.13974.79>
>> > >>>>    4b0b21c664 rexi_server: from: couchdb@172.31.83.32
>> (<0.20638.79>)
>> > >> mfa:
>> > >>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed
>> to
>> > >> access
>> > >>>>    this db.">>}
>> > >>>>
>> > >>
>> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>> > >>>>
>> > >>>> Does CouchDB have some logic built in that spawns a number of
>> couchjs
>> > >>>> processes on a "primary" node? Will future view processing then
>> always
>> > >> be
>> > >>>> routed to this "primary" node?
>> > >>>>
>> > >>>> Is there a way to better distribute these heavy duty couchjs
>> processes?
>> > >> Is
>> > >>>> it possible to limit their CPU consumption? (I'm hesitant to start
>> down
>> > >> the
>> > >>>> path of using something like cpulimit as I think there is a root
>> problem
>> > >>>> that needs to be addressed)
>> > >>>>
>> > >>>> I'm running out of ideas and hope that someone has some notion of
>> what
>> > >> is
>> > >>>> causing this bizarre load or if there is a bug in CouchDB.
>> > >>>>
>> > >>>> Thank you for any help you can provide!
>> > >>>>
>> > >>>> Geoff
>> > >>>
>> > >>> --
>> > >>> Professional Support for Apache CouchDB:
>> > >>> https://neighbourhood.ie/couchdb-support/
>> > >>>
>> > >>
>> > >>
>> >
>> > <views.json>
>>
>> --
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>>
>>

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to