Hi,

I've spent days using trial and error to try and figure out why I am
getting a very high CPU load on only a single node in my cluster. I'm
hoping someone has an idea of what is going on as I'm getting stuck.

Here's my configuration:

   1. 2 node cluster:
      1. Each node is located in a different AWS availability zone
      2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
   2. A haproxy server is load balancing traffic to the nodes using round
   robin

The problem:

   1. After users make changes via PouchDB, a backend runs a number of
   routines that use views to calculate notifications. The issue is that on a
   single node, the couchjs processes stack up and then start to consume
   nearly all the available CPU. This server then becomes the "workhorse" that
   always does *all* the heavy duty couchjs processing until I restart this
   node.
   2. It is important to note that both nodes have couchjs processes, but
   it is only a single node that has the couchjs processes that are using 100%
   CPU
   3. I've even resorted to setting `os_process_limit = 10` and this just
   results in each couchjs process taking over 10% each! In other words, the
   couchjs processes just eat up all the CPU no matter how many couchjs
   process there are!
   4. The CPU usage will eventually clear after all the processing is done,
   but then as soon as there is more to process the workhorse node will get
   bogged down again.
   5. If I restart the workhorse node, the other node then becomes the
   workhorse node. This is the only way to get the couchjs processes to "move"
   to another node.
   6. The problem is that this design is not scalable as only one node can
   be the workhorse node at any given time. Moreover this causes specific
   instances to run out of CPU credits. Shouldn't the couchjs processes be
   spread out over all my nodes? From what I can tell, if I add more nodes I'm
   still going to have the issue where only one of the nodes is getting bogged
   down. Is it possible that the problem is that I have 2 nodes and really I
   need at least 3 nodes? (I know a 2-node cluster is not very typical)


 Things I've checked:

   1. Ensured that the load balancing is working, i.e. haproxy is indeed
   distributing traffic accordingly
   2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit
   = 5` to see if I could force a more conservative usage of couchjs
   processes, but instead the couchjs processes just consume all the CPU load.
   3. I've tried simulating the issue locally with VMs and I cannot
   duplicate any such load. My guess is that this is because the nodes are
   located on the same box so hop distance between nodes is very small and
   this somehow keeps the CPU usage to a minimum
   4. I've tried isolating the issue by creating short code snippets that
   intentionally try to spawn a lot of couchjs processes and they are spawned
   but don't consume 100% CPU
   5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
   doesn't seem to change anything
   6. The only error entries in my CouchDB logs are like the following and
   I don't believe they are related to my issue:
      1.

      [error] 2017-12-04T18:13:38.728970Z couchdb@172.31.83.32 <0.13974.79>
      4b0b21c664 rexi_server: from: couchdb@172.31.83.32(<0.20638.79>) mfa:
      fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to access
      this db.">>}
      
[{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]

Does CouchDB have some logic built in that spawns a number of couchjs
processes on a "primary" node? Will future view processing then always be
routed to this "primary" node?

Is there a way to better distribute these heavy duty couchjs processes? Is
it possible to limit their CPU consumption? (I'm hesitant to start down the
path of using something like cpulimit as I think there is a root problem
that needs to be addressed)

I'm running out of ideas and hope that someone has some notion of what is
causing this bizarre load or if there is a bug in CouchDB.

Thank you for any help you can provide!

Geoff

Reply via email to