Hi, I've spent days using trial and error to try and figure out why I am getting a very high CPU load on only a single node in my cluster. I'm hoping someone has an idea of what is going on as I'm getting stuck.
Here's my configuration: 1. 2 node cluster: 1. Each node is located in a different AWS availability zone 2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem) 2. A haproxy server is load balancing traffic to the nodes using round robin The problem: 1. After users make changes via PouchDB, a backend runs a number of routines that use views to calculate notifications. The issue is that on a single node, the couchjs processes stack up and then start to consume nearly all the available CPU. This server then becomes the "workhorse" that always does *all* the heavy duty couchjs processing until I restart this node. 2. It is important to note that both nodes have couchjs processes, but it is only a single node that has the couchjs processes that are using 100% CPU 3. I've even resorted to setting `os_process_limit = 10` and this just results in each couchjs process taking over 10% each! In other words, the couchjs processes just eat up all the CPU no matter how many couchjs process there are! 4. The CPU usage will eventually clear after all the processing is done, but then as soon as there is more to process the workhorse node will get bogged down again. 5. If I restart the workhorse node, the other node then becomes the workhorse node. This is the only way to get the couchjs processes to "move" to another node. 6. The problem is that this design is not scalable as only one node can be the workhorse node at any given time. Moreover this causes specific instances to run out of CPU credits. Shouldn't the couchjs processes be spread out over all my nodes? From what I can tell, if I add more nodes I'm still going to have the issue where only one of the nodes is getting bogged down. Is it possible that the problem is that I have 2 nodes and really I need at least 3 nodes? (I know a 2-node cluster is not very typical) Things I've checked: 1. Ensured that the load balancing is working, i.e. haproxy is indeed distributing traffic accordingly 2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit = 5` to see if I could force a more conservative usage of couchjs processes, but instead the couchjs processes just consume all the CPU load. 3. I've tried simulating the issue locally with VMs and I cannot duplicate any such load. My guess is that this is because the nodes are located on the same box so hop distance between nodes is very small and this somehow keeps the CPU usage to a minimum 4. I've tried isolating the issue by creating short code snippets that intentionally try to spawn a lot of couchjs processes and they are spawned but don't consume 100% CPU 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this doesn't seem to change anything 6. The only error entries in my CouchDB logs are like the following and I don't believe they are related to my issue: 1. [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32 <0.13974.79> 4b0b21c664 rexi_server: from: couchdb@172.31.83.32(<0.20638.79>) mfa: fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to access this db.">>} [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] Does CouchDB have some logic built in that spawns a number of couchjs processes on a "primary" node? Will future view processing then always be routed to this "primary" node? Is there a way to better distribute these heavy duty couchjs processes? Is it possible to limit their CPU consumption? (I'm hesitant to start down the path of using something like cpulimit as I think there is a root problem that needs to be addressed) I'm running out of ideas and hope that someone has some notion of what is causing this bizarre load or if there is a bug in CouchDB. Thank you for any help you can provide! Geoff