Adam, thanks for your fast answer. I only posted metrics with significant changes before/after crash. That was not the case for message queues and process counts, but for info here are the values before the crash, when RAM is saturated:
"os_proc_count":2, "stale_proc_count":0, "process_count":1520, "process_limit":262144, "message_queues":{ "couch_file":{ "count":541, "min":0, "max":0, "50":0, "90":0, "99":0 }, "couch_db_updater":{ "count":479, "min":0, "max":0, "50":0, "90":0, "99":0 }, "folsom_sample_slide_sup":0, "chttpd_auth_cache":0, "global_changes_sup":0, "couch_drv":0, "chttpd":0, ... ... (all other values are 0) Thanks for clarifying that context_switches, reductions, etc. are simple counters (I was wondering if they were *totals* or *per hour/day*). I also tried to figure out *which* CouchDB part consumes this amount of memory. Before crash, when RAM was saturated: - CouchDB had only 6 to 10 processes running (according to `ps -u couchdb`), including `epmd`, `beam.smp`, `erl_child_setup`, `couchjs`... - The extra RAM usage is done by `beam.smp` (according to `ps -o rss $PID_OF_BEAM`), see in COUCHDB_RAM in https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png Said differently, during spikes `beam.smp` is using 95% of total memory. Vladimir, thanks for your feedback! I'll definitely try to measure the mailbox size, using your commands, next time CouchDB eats up all RAM (it should be in a few weeks). However it shouldn't be related to a node being down, because it happens on 1-node setups. Adrien Le ven. 14 juin 2019 à 15:53, Adam Kocoloski <a...@kocolosk.net> a écrit : > Hi Adrien, > > Hi Adrien, there are some additional metrics in the _system output that > you omitted regarding message queue lengths and process counts. Did you see > any significant difference in those? > > The reason I’m asking is to try and figure out whether a small set of > known processes within the Erlang VM are consuming a lot of memory > (possibly because they have large message backlogs), or whether you might > have a large number of processes hanging around and never getting cleaned > up. > > Aside from the memory numbers, most of the other metrics you pointed out > (context_switches, reductions, etc.) are simple counters and so they’re > only really useful when you look at their derivative. > > Adam > > > On Jun 14, 2019, at 9:24 AM, Adrien Vergé <adrien.ve...@tolteck.com> > wrote: > > > > Hi Jérôme and Adam, > > > > That's funny, because I'm investigating the exact same problem these > days. > > We have a two CouchDB setups: > > - a one-node server (q=2 n=1) with 5000 databases > > - a 3-node cluster (q=2 n=3) with 50000 databases > > > > ... and we are experiencing the problem on both setups. We've been having > > this problem for at least 3-4 months. > > > > We've monitored: > > > > - The number of open files: it's relatively low (both the system's total > > and or fds opened by beam.smp). > > https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png > > > > - The usage of RAM, total used and used by beam.smp > > https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png > > It continuously grows, with regular spikes, until killing CouchDB with > an > > OOM. After restart, the RAM usage is nice and low, and no spikes. > > > > - /_node/_local/_system metrics, before and after restart. Values that > > significantly differ (before / after restart) are listed here: > > - uptime (obviously ;-)) > > - memory.processes : + 3732 % > > - memory.processes_used : + 3735 % > > - memory.binary : + 17700 % > > - context_switches : + 17376 % > > - reductions : + 867832 % > > - garbage_collection_count : + 448248 % > > - words_reclaimed : + 112755 % > > - io_input : + 44226 % > > - io_output : + 157951 % > > > > Before CouchDB restart: > > { > > "uptime":2712973, > > "memory":{ > > "other":7250289, > > "atom":512625, > > "atom_used":510002, > > "processes":1877591424, > > "processes_used":1877504920, > > "binary":177468848, > > "code":9653286, > > "ets":16012736 > > }, > > "run_queue":0, > > "ets_table_count":102, > > "context_switches":1621495509, > > "reductions":968705947589, > > "garbage_collection_count":331826928, > > "words_reclaimed":269964293572, > > "io_input":8812455, > > "io_output":20733066, > > ... > > > > After CouchDB restart: > > { > > "uptime":206, > > "memory":{ > > "other":6907493, > > "atom":512625, > > "atom_used":497769, > > "processes":49001944, > > "processes_used":48963168, > > "binary":997032, > > "code":9233842, > > "ets":4779576 > > }, > > "run_queue":0, > > "ets_table_count":102, > > "context_switches":1015486, > > "reductions":111610788, > > "garbage_collection_count":74011, > > "words_reclaimed":239214127, > > "io_input":19881, > > "io_output":13118, > > ... > > > > Adrien > > > > Le ven. 14 juin 2019 à 15:11, Jérôme Augé <jerome.a...@anakeen.com> a > > écrit : > > > >> Ok, so I'll setup a cron job to journalize (every minute?) the output > from > >> "/_node/_local/_system" and wait for the next OOM kill. > >> > >> Any property from "_system" to look for in particular? > >> > >> Here is a link to the memory usage graph: > >> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png > >> > >> The memory usage varies, but the general trend is to go up with some > >> regularity over a week until we reach OOM. When "beam.smp" is killed, > it's > >> reported as consuming 15 GB (as seen in the kernel's OOM trace in > syslog). > >> > >> Thanks, > >> Jérôme > >> > >> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <kocol...@apache.org> a > >> écrit : > >> > >>> Hi Jérôme, > >>> > >>> Thanks for a well-written and detailed report (though the mailing list > >>> strips attachments). The _system endpoint provides a lot of useful data > >> for > >>> debugging these kinds of situations; do you have a snapshot of the > output > >>> when the system was consuming a lot of memory? > >>> > >>> > >>> > >> > http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system > >>> > >>> Adam > >>> > >>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé <jerome.a...@anakeen.com> > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I'm having a hard time figuring out the high memory usage of a CouchDB > >>> server. > >>>> > >>>> What I'm observing is that the memory consumption from the "beam.smp" > >>> process gradually rises until it triggers the kernel's OOM > >> (Out-Of-Memory) > >>> which kill the "beam.smp" process. > >>>> > >>>> It also seems that many databases are not compacted: I've made a > script > >>> to iterate over the databases to compute de fragmentation factor, and > it > >>> seems I have around 2100 databases with a frag > 70%. > >>>> > >>>> We have a single CouchDB v2.1.1server (configured with q=8 n=1) and > >>> around 2770 databases. > >>>> > >>>> The server initially had 4 GB of RAM, and we are now with 16 GB w/ 8 > >>> vCPU, and it still regularly reaches OOM. From the monitoring I see > that > >>> with 16 GB the OOM is almost triggered once per week (c.f. attached > >> graph). > >>>> > >>>> The memory usage seems to increase gradually until it reaches OOM. > >>>> > >>>> The Couch server is mostly used by web clients with the PouchDB JS > API. > >>>> > >>>> We have ~1300 distinct users and by monitoring the netstat/TCP > >>> established connections I guess we have around 100 (maximum) users at > any > >>> given time. From what I understanding of the application's logic, each > >> user > >>> access 2 private databases (read/write) + 1 common database > (read-only). > >>>> > >>>> On-disk usage of CouchDB's data directory is around 40 GB. > >>>> > >>>> Any ideas on what could cause such behavior (increasing memory usage > >>> over the course of a week)? Or how to find what is happening behind the > >>> scene? > >>>> > >>>> Regards, > >>>> Jérôme > >>> > >> > >