Well, I'm back at this and here is the latest info and I think it may be related to writes and the _global_changes database:
1. I run my production env and one of my nodes becomes the "workhorse" node with 100% CPU 2. I stop all my production code from generating any more CouchDB requests and eventually the workhorse node goes back to 0% CPU 3. I can then issue writes on a single database (really any database and ANY node--not just the workhorse node) and the workhorse node will kick back up to 100% CPU. If I stop the writes, the workhorse node will return to 0% CPU. 4. And now the punch line: if I delete the _global_changes database, the CPU drops down to 0% even if I am issuing writes! Pure cray cray Any thoughts? (Sorry, still working on a reproducible env for everyone) On Wed, Dec 6, 2017 at 6:56 AM Geoffrey Cox <redge...@gmail.com> wrote: > Interesting, I read somewhere that having a view per ddoc is more > efficient. Thanks for clarifying! > On Wed, Dec 6, 2017 at 1:31 AM Jan Lehnardt <j...@apache.org> wrote: > >> >> > On 5. Dec 2017, at 21:13, Geoffrey Cox <redge...@gmail.com> wrote: >> > >> > Hey Adam, >> > >> > Attached is my local.ini and the design doc with the view JS. >> > >> > Please see my responses below: >> > >> > Thanks for the help! >> > >> > On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <kocol...@apache.org> >> wrote: >> > Hi Geoff, a couple of additional questions: >> > >> > 1) Are you making these view requests with stale=ok or >> stale=update_after? >> > GC: I am not using the stale parameter >> > 2) What are you using for N and Q in the [cluster] configuration >> settings? >> > GC: As per the attached local.ini, I specified n=2 and am using the >> default q=8. >> > 3) Did you take advantage of the (barely-documented) “zones" attribute >> when defining cluster members? >> > GC: As per the attached local.ini, I have *not* specified this option. >> > 4) Do you have any other JS code besides the view definitions? >> > GC: When you refer to JS code, I think you mean in terms of JS code >> "in" CouchDB and if that is the case then my only JS code is very simple >> views like those in the attached view.json. (I know that I really need to >> break out the views so that there is one view per doc, but I haven't quite >> gotten around to refactoring this and I don't believe this is causing the >> CPU usage) >> >> Quick comment on one or multiple view(s)-per-ddoc: this is a performance >> trade-off and not either one is always correct. But generally, I would >> recommend grouping all views an app would need into a single ddoc. >> >> For each ddoc, all docs in a database have to be serialised and shipped >> to couchjs and the results are shipped back, that’s the bulk of the work in >> view indexing. Evaluating a single map/reduce function is comparatively >> minuscule, so grouping views in a single ddoc makes that more efficient. >> >> >> >> > >> > Regarding #1, the cluster will actually select shards differently >> depending on the use of those query parameters. When your request >> stipulates that you’re OK with stale results the cluster *will* select a >> “primary” copy in order to improve the consistency of repeated requests to >> the same view. The algorithm for choosing those primary copies is somewhat >> subtle hence my question #3. >> > >> > If you’re not using stale requests I have a much harder time explaining >> why the 100% CPU issue would migrate from node to node like that. >> > >> > Adam >> > >> > > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <redge...@gmail.com> wrote: >> > > >> > > Thanks for the responses, any other thoughts? >> > > >> > > FYI: I’m trying to work on a very focused test case that I can share >> with >> > > the Dev team, but it is taking a little while to narrow down the exact >> > > cause. >> > > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson < >> rnew...@apache.org> >> > > wrote: >> > > >> > >> Sorry to contradict you, but Cloudant deploys clusters across amazon >> AZ's >> > >> as standard. It's fast enough. It's cross-region that you need to >> avoid. >> > >> >> > >> B. >> > >> >> > >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <j...@apache.org> wrote: >> > >>> >> > >>> Heya Geoff, >> > >>> >> > >>> a CouchDB cluster is designed to run in the same data center / with >> > >> local are networking latencies. A cluster across AWS Availability >> Zones >> > >> won’t work as you see. If you want CouchDB’s in both AZs, use regular >> > >> replication and keep the clusters local to the AZ. >> > >>> >> > >>> Best >> > >>> Jan >> > >>> -- >> > >>> >> > >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <redge...@gmail.com> wrote: >> > >>>> >> > >>>> Hi, >> > >>>> >> > >>>> I've spent days using trial and error to try and figure out why I >> am >> > >>>> getting a very high CPU load on only a single node in my cluster. >> I'm >> > >>>> hoping someone has an idea of what is going on as I'm getting >> stuck. >> > >>>> >> > >>>> Here's my configuration: >> > >>>> >> > >>>> 1. 2 node cluster: >> > >>>> 1. Each node is located in a different AWS availability zone >> > >>>> 2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem) >> > >>>> 2. A haproxy server is load balancing traffic to the nodes using >> round >> > >>>> robin >> > >>>> >> > >>>> The problem: >> > >>>> >> > >>>> 1. After users make changes via PouchDB, a backend runs a number of >> > >>>> routines that use views to calculate notifications. The issue is >> that >> > >> on a >> > >>>> single node, the couchjs processes stack up and then start to >> consume >> > >>>> nearly all the available CPU. This server then becomes the >> "workhorse" >> > >> that >> > >>>> always does *all* the heavy duty couchjs processing until I restart >> > >> this >> > >>>> node. >> > >>>> 2. It is important to note that both nodes have couchjs processes, >> but >> > >>>> it is only a single node that has the couchjs processes that are >> using >> > >> 100% >> > >>>> CPU >> > >>>> 3. I've even resorted to setting `os_process_limit = 10` and this >> just >> > >>>> results in each couchjs process taking over 10% each! In other >> words, >> > >> the >> > >>>> couchjs processes just eat up all the CPU no matter how many >> couchjs >> > >>>> process there are! >> > >>>> 4. The CPU usage will eventually clear after all the processing is >> > >> done, >> > >>>> but then as soon as there is more to process the workhorse node >> will >> > >> get >> > >>>> bogged down again. >> > >>>> 5. If I restart the workhorse node, the other node then becomes the >> > >>>> workhorse node. This is the only way to get the couchjs processes >> to >> > >> "move" >> > >>>> to another node. >> > >>>> 6. The problem is that this design is not scalable as only one >> node can >> > >>>> be the workhorse node at any given time. Moreover this causes >> specific >> > >>>> instances to run out of CPU credits. Shouldn't the couchjs >> processes be >> > >>>> spread out over all my nodes? From what I can tell, if I add more >> > >> nodes I'm >> > >>>> still going to have the issue where only one of the nodes is >> getting >> > >> bogged >> > >>>> down. Is it possible that the problem is that I have 2 nodes and >> > >> really I >> > >>>> need at least 3 nodes? (I know a 2-node cluster is not very >> typical) >> > >>>> >> > >>>> >> > >>>> Things I've checked: >> > >>>> >> > >>>> 1. Ensured that the load balancing is working, i.e. haproxy is >> indeed >> > >>>> distributing traffic accordingly >> > >>>> 2. I've tried setting `os_process_limit = 10` and >> > >> `os_process_soft_limit >> > >>>> = 5` to see if I could force a more conservative usage of couchjs >> > >>>> processes, but instead the couchjs processes just consume all the >> CPU >> > >> load. >> > >>>> 3. I've tried simulating the issue locally with VMs and I cannot >> > >>>> duplicate any such load. My guess is that this is because the >> nodes are >> > >>>> located on the same box so hop distance between nodes is very >> small and >> > >>>> this somehow keeps the CPU usage to a minimum >> > >>>> 4. I've tried isolating the issue by creating short code snippets >> that >> > >>>> intentionally try to spawn a lot of couchjs processes and they are >> > >> spawned >> > >>>> but don't consume 100% CPU >> > >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and >> this >> > >>>> doesn't seem to change anything >> > >>>> 6. The only error entries in my CouchDB logs are like the >> following and >> > >>>> I don't believe they are related to my issue: >> > >>>> 1. >> > >>>> >> > >>>> [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32 >> > >> <0.13974.79> >> > >>>> 4b0b21c664 rexi_server: from: couchdb@172.31.83.32 >> (<0.20638.79>) >> > >> mfa: >> > >>>> fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed >> to >> > >> access >> > >>>> this db.">>} >> > >>>> >> > >> >> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] >> > >>>> >> > >>>> Does CouchDB have some logic built in that spawns a number of >> couchjs >> > >>>> processes on a "primary" node? Will future view processing then >> always >> > >> be >> > >>>> routed to this "primary" node? >> > >>>> >> > >>>> Is there a way to better distribute these heavy duty couchjs >> processes? >> > >> Is >> > >>>> it possible to limit their CPU consumption? (I'm hesitant to start >> down >> > >> the >> > >>>> path of using something like cpulimit as I think there is a root >> problem >> > >>>> that needs to be addressed) >> > >>>> >> > >>>> I'm running out of ideas and hope that someone has some notion of >> what >> > >> is >> > >>>> causing this bizarre load or if there is a bug in CouchDB. >> > >>>> >> > >>>> Thank you for any help you can provide! >> > >>>> >> > >>>> Geoff >> > >>> >> > >>> -- >> > >>> Professional Support for Apache CouchDB: >> > >>> https://neighbourhood.ie/couchdb-support/ >> > >>> >> > >> >> > >> >> > >> > <views.json> >> >> -- >> Professional Support for Apache CouchDB: >> https://neighbourhood.ie/couchdb-support/ >> >>