Adam, thanks for your fast answer.

I only posted metrics with significant changes before/after crash. That was
not the case for message queues and process counts, but for info here are
the values before the crash, when RAM is saturated:

  "os_proc_count":2,
  "stale_proc_count":0,
  "process_count":1520,
  "process_limit":262144,
  "message_queues":{
    "couch_file":{
      "count":541,
      "min":0,
      "max":0,
      "50":0,
      "90":0,
      "99":0
    },
    "couch_db_updater":{
      "count":479,
      "min":0,
      "max":0,
      "50":0,
      "90":0,
      "99":0
    },
    "folsom_sample_slide_sup":0,
    "chttpd_auth_cache":0,
    "global_changes_sup":0,
    "couch_drv":0,
    "chttpd":0,
    ...
    ... (all other values are 0)

Thanks for clarifying that context_switches, reductions, etc. are simple
counters (I was wondering if they were *totals* or *per hour/day*).

I also tried to figure out *which* CouchDB part consumes this amount of
memory. Before crash, when RAM was saturated:
- CouchDB had only 6 to 10 processes running (according to `ps -u
couchdb`), including `epmd`, `beam.smp`, `erl_child_setup`, `couchjs`...
- The extra RAM usage is done by `beam.smp` (according to `ps -o rss
$PID_OF_BEAM`), see in COUCHDB_RAM in
https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
  Said differently, during spikes `beam.smp` is using 95% of total memory.

Vladimir, thanks for your feedback! I'll definitely try to measure the
mailbox size, using your commands, next time CouchDB eats up all RAM (it
should be in a few weeks).
However it shouldn't be related to a node being down, because it happens on
1-node setups.

Adrien

Le ven. 14 juin 2019 à 15:53, Adam Kocoloski <a...@kocolosk.net> a écrit :

> Hi Adrien,
>
> Hi Adrien, there are some additional metrics in the _system output that
> you omitted regarding message queue lengths and process counts. Did you see
> any significant difference in those?
>
> The reason I’m asking is to try and figure out whether a small set of
> known processes within the Erlang VM are consuming a lot of memory
> (possibly because they have large message backlogs), or whether you might
> have a large number of processes hanging around and never getting cleaned
> up.
>
> Aside from the memory numbers, most of the other metrics you pointed out
> (context_switches, reductions, etc.) are simple counters and so they’re
> only really useful when you look at their derivative.
>
> Adam
>
> > On Jun 14, 2019, at 9:24 AM, Adrien Vergé <adrien.ve...@tolteck.com>
> wrote:
> >
> > Hi Jérôme and Adam,
> >
> > That's funny, because I'm investigating the exact same problem these
> days.
> > We have a two CouchDB setups:
> > - a one-node server (q=2 n=1) with 5000 databases
> > - a 3-node cluster (q=2 n=3) with 50000 databases
> >
> > ... and we are experiencing the problem on both setups. We've been having
> > this problem for at least 3-4 months.
> >
> > We've monitored:
> >
> > - The number of open files: it's relatively low (both the system's total
> > and or fds opened by beam.smp).
> >  https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
> >
> > - The usage of RAM, total used and used by beam.smp
> >  https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
> >  It continuously grows, with regular spikes, until killing CouchDB with
> an
> > OOM. After restart, the RAM usage is nice and low, and no spikes.
> >
> > - /_node/_local/_system metrics, before and after restart. Values that
> > significantly differ (before / after restart) are listed here:
> >  - uptime (obviously ;-))
> >  - memory.processes : + 3732 %
> >  - memory.processes_used : + 3735 %
> >  - memory.binary : + 17700 %
> >  - context_switches : + 17376 %
> >  - reductions : + 867832 %
> >  - garbage_collection_count : + 448248 %
> >  - words_reclaimed : + 112755 %
> >  - io_input : + 44226 %
> >  - io_output : + 157951 %
> >
> > Before CouchDB restart:
> > {
> >  "uptime":2712973,
> >  "memory":{
> >    "other":7250289,
> >    "atom":512625,
> >    "atom_used":510002,
> >    "processes":1877591424,
> >    "processes_used":1877504920,
> >    "binary":177468848,
> >    "code":9653286,
> >    "ets":16012736
> >  },
> >  "run_queue":0,
> >  "ets_table_count":102,
> >  "context_switches":1621495509,
> >  "reductions":968705947589,
> >  "garbage_collection_count":331826928,
> >  "words_reclaimed":269964293572,
> >  "io_input":8812455,
> >  "io_output":20733066,
> >  ...
> >
> > After CouchDB restart:
> > {
> >  "uptime":206,
> >  "memory":{
> >    "other":6907493,
> >    "atom":512625,
> >    "atom_used":497769,
> >    "processes":49001944,
> >    "processes_used":48963168,
> >    "binary":997032,
> >    "code":9233842,
> >    "ets":4779576
> >  },
> >  "run_queue":0,
> >  "ets_table_count":102,
> >  "context_switches":1015486,
> >  "reductions":111610788,
> >  "garbage_collection_count":74011,
> >  "words_reclaimed":239214127,
> >  "io_input":19881,
> >  "io_output":13118,
> >  ...
> >
> > Adrien
> >
> > Le ven. 14 juin 2019 à 15:11, Jérôme Augé <jerome.a...@anakeen.com> a
> > écrit :
> >
> >> Ok, so I'll setup a cron job to journalize (every minute?) the output
> from
> >> "/_node/_local/_system" and wait for the next OOM kill.
> >>
> >> Any property from "_system" to look for in particular?
> >>
> >> Here is a link to the memory usage graph:
> >> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
> >>
> >> The memory usage varies, but the general trend is to go up with some
> >> regularity over a week until we reach OOM. When "beam.smp" is killed,
> it's
> >> reported as consuming 15 GB (as seen in the kernel's OOM trace in
> syslog).
> >>
> >> Thanks,
> >> Jérôme
> >>
> >> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <kocol...@apache.org> a
> >> écrit :
> >>
> >>> Hi Jérôme,
> >>>
> >>> Thanks for a well-written and detailed report (though the mailing list
> >>> strips attachments). The _system endpoint provides a lot of useful data
> >> for
> >>> debugging these kinds of situations; do you have a snapshot of the
> output
> >>> when the system was consuming a lot of memory?
> >>>
> >>>
> >>>
> >>
> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
> >>>
> >>> Adam
> >>>
> >>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé <jerome.a...@anakeen.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm having a hard time figuring out the high memory usage of a CouchDB
> >>> server.
> >>>>
> >>>> What I'm observing is that the memory consumption from the "beam.smp"
> >>> process gradually rises until it triggers the kernel's OOM
> >> (Out-Of-Memory)
> >>> which kill the "beam.smp" process.
> >>>>
> >>>> It also seems that many databases are not compacted: I've made a
> script
> >>> to iterate over the databases to compute de fragmentation factor, and
> it
> >>> seems I have around 2100 databases with a frag > 70%.
> >>>>
> >>>> We have a single CouchDB v2.1.1server (configured with q=8 n=1) and
> >>> around 2770 databases.
> >>>>
> >>>> The server initially had 4 GB of RAM, and we are now with 16 GB w/ 8
> >>> vCPU, and it still regularly reaches OOM. From the monitoring I see
> that
> >>> with 16 GB the OOM is almost triggered once per week (c.f. attached
> >> graph).
> >>>>
> >>>> The memory usage seems to increase gradually until it reaches OOM.
> >>>>
> >>>> The Couch server is mostly used by web clients with the PouchDB JS
> API.
> >>>>
> >>>> We have ~1300 distinct users and by monitoring the netstat/TCP
> >>> established connections I guess we have around 100 (maximum) users at
> any
> >>> given time. From what I understanding of the application's logic, each
> >> user
> >>> access 2 private databases (read/write) + 1 common database
> (read-only).
> >>>>
> >>>> On-disk usage of CouchDB's data directory is around 40 GB.
> >>>>
> >>>> Any ideas on what could cause such behavior (increasing memory usage
> >>> over the course of a week)? Or how to find what is happening behind the
> >>> scene?
> >>>>
> >>>> Regards,
> >>>> Jérôme
> >>>
> >>
>
>

Reply via email to