Hi all,

Here is more feedback, since one of our CouchDB servers crashed last night.

- Setup: CouchDB 2.3.1 on a 3-node cluster (q=2 n=3) with ~50k small
databases.

- Only one of the 3 nodes crashed. Others should crash in a few days (they
  usually crash and restart every ~3 weeks).

- update_lru_on_read = false

- The extra memory consumption comes from beam.smp process (see graph
below).

- The crash is an OOM, see the last log lines before restart:

      eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type
"old_heap").
      Crash dump is being written to: erl_crash.dump...
      [os_mon] memory supervisor port (memsup): Erlang has closed
      [os_mon] cpu supervisor port (cpu_sup): Erlang has closed

- Over last weeks, beam.smp memory usage kept increasing and increasing. See
  the graph I made at https://framapic.org/jnHAyVEKq98k/kXCQv3pyUdz0.png

- /_node/_local/_system metrics look normal. The difference between an
"about
  to crash" node and a "freshly restarted and lots of free RAM" node, is in
  uptime, memory.processes_used, memory.binary, context_switches,
reductions,
  garbage_collection_count, io_input... as previously discussed with Adam.

- This command gives exactly the same output on an "about to crash" node,
than
  on a "freshly restarted and lots of free RAM" node:

      MQSizes2 = lists:map(fun(A) -> {_,B} = case
process_info(A,total_heap_size)
      of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} end,
{B,A}
      end, processes()).

Le jeu. 20 juin 2019 à 17:08, Jérôme Augé <[email protected]> a
écrit :

> We are going to plan an upgrade from 2.1.1 to 2.3.1 in the coming weeks.
>
> I have a side question concerning CouchDB's upgrades: is the database
> binary compatible between v2.1.1 and v2.3.1? In the case we ever need to
> downgrade back to 2.1.1, do the binary data can be kept?
>
> Regards,
> Jérôme
>
> Le mer. 19 juin 2019 à 08:59, Jérôme Augé <[email protected]> a
> écrit :
>
> > Thanks Adam for your explanations!
> >
> > The "update_lru_on_read" is already set to false on this instance (I had
> > already seen the comments on these pull-requests).
> >
> > We are effectively running an "old" 2.1.1 version, and we have advised
> the
> > client that an upgrade might be needed to sort out (or further
> investigate)
> > these problems.
> >
> > Thanks again,
> > Jérôme
> >
> >
> >
> > Le mar. 18 juin 2019 à 18:59, Adam Kocoloski <[email protected]> a
> > écrit :
> >
> >> Hi Jérôme, definitely useful.
> >>
> >> The “run_queue” is the number of Erlang processes in a runnable state
> >> that are not currently executing on a scheduler. When that value is
> greater
> >> than zero it means the node is hitting some compute limitations. Seeing
> a
> >> small positive value from time to time is no problem.
> >>
> >> Your last six snapshots show a message queue backlog in couch_server.
> >> That could be what caused the node to OOM. The couch_server process is a
> >> singleton and if it accumulates a large message backlog there are
> limited
> >> backpressure or scaling mechanisms to help it recover. I noticed you’re
> >> running 2.1.1; there were a couple of important enhancements to reduce
> the
> >> message flow through couch_server in more recent releases:
> >>
> >> 2.2.0: https://github.com/apache/couchdb/pull/1118 <
> >> https://github.com/apache/couchdb/pull/1118>
> >> 2.3.1: https://github.com/apache/couchdb/pull/1593 <
> >> https://github.com/apache/couchdb/pull/1593>
> >>
> >> The change in 2.2.0 is just a change in the default configuration; you
> >> can try applying it to your server by setting:
> >>
> >> [couchdb]
> >> update_lru_on_read = false
> >>
> >> The changes in 2.3.1 offer additional benefits for couch_server message
> >> throughput but you’ll need to upgrade to get them.
> >>
> >> Cheers, Adam
> >>
> >> P.S. II don’t know what’s going on with the negative memory.other value
> >> there, it’s not intentionally meaningful :)
> >>
> >>
> >> > On Jun 18, 2019, at 11:30 AM, Jérôme Augé <[email protected]>
> >> wrote:
> >> >
> >> > "beam.smp" just got killed by OOM, but I was not in front of the
> >> machine to
> >> > perform this command...
> >> >
> >> > However, here is the CouchDB log of "/_node/_local/_system" for the 30
> >> > minutes preceding the OOM:
> >> > -
> >> >
> >>
> https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log
> >> >
> >> > I guess the spike that triggers the OOM is so quick (< 1min) that it
> >> does
> >> > not gets logged (I log every minute).
> >> >
> >> > Is there anything that can be used/deduced from the last line logged
> at
> >> > 2019-06-18T16:00:14+0200?
> >> >
> >> > At 15:55:25, the "run_queue" is at 36: what does it means? Number of
> >> active
> >> > concurrent requests?
> >> >
> >> > From 15:56 to 16:00 the "memory"."other" value is a negative value:
> >> does it
> >> > means something special? or just an integer overflow?
> >> >
> >> >
> >> >
> >> > Le lun. 17 juin 2019 à 14:09, Vladimir Ralev <
> [email protected]>
> >> a
> >> > écrit :
> >> >
> >> >> Alright, I think the issue will be more visible towards the OOM
> point,
> >> >> however for now since you have the system live with a leak, it will
> be
> >> >> useful to repeat the same steps, but replace
> >> >> "message_queue_len" with "total_heap_size" then with "heap_size" then
> >> with
> >> >> "stack_size" and then with "reductions".
> >> >>
> >> >> For example:
> >> >>
> >> >> MQSizes2 = lists:map(fun(A) -> {_,B} = case
> >> process_info(A,total_heap_size)
> >> >> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} end,
> >> {B,A}
> >> >> end, processes()).
> >> >>
> >> >> Then same with the other params.
> >> >>
> >> >> That can shed some light, otherwise someone will need to monitor
> >> process
> >> >> count and go into them by age and memory patterns.
> >> >>
> >> >> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <[email protected]
> >
> >> >> wrote:
> >> >>
> >> >>> The 2G consumption is from Adrien's system.
> >> >>>
> >> >>> On mine, since I setup the logging of "/_node/_local/_system"
> output :
> >> >>> - on june 14th max memory.processes was 2.6 GB
> >> >>> - on june 15th max memory.processes was 4.7 GB
> >> >>> - on june 16th max memory.processes was 7.0 GB
> >> >>> - today (june 17th) max memory.processes was 8.0 GB (and with an
> >> >>> interactive top I see spikes at 12 GB)
> >> >>>
> >> >>> The memory.processes seems to be steadily increasing over the days,
> >> and
> >> >> I'm
> >> >>> soon expecting the out-of-memory condition to be triggered in a
> >> couple of
> >> >>> days.
> >> >>>
> >> >>> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <
> >> [email protected]>
> >> >> a
> >> >>> écrit :
> >> >>>
> >> >>>> Nothing to see here, the message queue stat from Adam's advice is
> >> >>> accurate.
> >> >>>> Note that you should run this only when there is already an
> >> >> unreasonable
> >> >>>> amount memory leaked/consumed.
> >> >>>>
> >> >>>> But now I realise you had "processes":1877591424 before restart
> from
> >> >> the
> >> >>>> stats above which is less than 2G. Are you using only 2 gigs of
> RAM?
> >> I
> >> >>> got
> >> >>>> confused by the initial comment and I thought you had 15GB RAM. If
> >> you
> >> >>> are
> >> >>>> only using 2 gigs of RAM, it's probably not enough for your
> workload.
> >> >>>>
> >> >>>> On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <
> >> [email protected]>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> That command seems to work, and here is the output:
> >> >>>>>
> >> >>>>> --8<--
> >> >>>>> # /opt/couchdb/bin/remsh < debug.2.remsh
> >> >>>>> Eshell V7.3  (abort with ^G)
> >> >>>>> ([email protected])1> [{0,<0.0.0>},
> >> >>>>> {0,<0.3.0>},
> >> >>>>> {0,<0.6.0>},
> >> >>>>> {0,<0.7.0>},
> >> >>>>> {0,<0.9.0>},
> >> >>>>> {0,<0.10.0>},
> >> >>>>> {0,<0.11.0>},
> >> >>>>> {0,<0.12.0>},
> >> >>>>> {0,<0.14.0>},
> >> >>>>> {0,<0.15.0>},
> >> >>>>> {0,<0.16.0>},
> >> >>>>> {0,<0.17.0>},
> >> >>>>> {0,<0.18.0>},
> >> >>>>> {0,<0.19.0>},
> >> >>>>> {0,<0.20.0>},
> >> >>>>> {0,<0.21.0>},
> >> >>>>> {0,<0.22.0>},
> >> >>>>> {0,<0.23.0>},
> >> >>>>> {0,<0.24.0>},
> >> >>>>> {0,<0.25.0>},
> >> >>>>> {0,<0.26.0>},
> >> >>>>> {0,<0.27.0>},
> >> >>>>> {0,<0.28.0>},
> >> >>>>> {0,<0.29.0>},
> >> >>>>> {0,<0.31.0>},
> >> >>>>> {0,<0.32.0>},
> >> >>>>> {0,<0.33.0>},
> >> >>>>> {0,...},
> >> >>>>> {...}]
> >> >>>>> ([email protected])2> {0,<0.38.0>}
> >> >>>>> ([email protected])3>
> [{current_function,{erl_eval,do_apply,6}},
> >> >>>>> {initial_call,{erlang,apply,2}},
> >> >>>>> {status,running},
> >> >>>>> {message_queue_len,0},
> >> >>>>> {messages,[]},
> >> >>>>> {links,[<0.32.0>]},
> >> >>>>> {dictionary,[]},
> >> >>>>> {trap_exit,false},
> >> >>>>> {error_handler,error_handler},
> >> >>>>> {priority,normal},
> >> >>>>> {group_leader,<0.31.0>},
> >> >>>>> {total_heap_size,5172},
> >> >>>>> {heap_size,2586},
> >> >>>>> {stack_size,24},
> >> >>>>> {reductions,24496},
> >> >>>>> {garbage_collection,[{min_bin_vheap_size,46422},
> >> >>>>>                      {min_heap_size,233},
> >> >>>>>                      {fullsweep_after,65535},
> >> >>>>>                      {minor_gcs,1}]},
> >> >>>>> {suspending,[]}]
> >> >>>>> ([email protected])4> *** Terminating erlang ('
> >> >> [email protected]
> >> >>> ')
> >> >>>>> -->8--
> >> >>>>>
> >> >>>>> What should I be looking for in this output?
> >> >>>>>
> >> >>>>> Le ven. 14 juin 2019 à 17:30, Vladimir Ralev <
> >> >> [email protected]
> >> >>>>
> >> >>>> a
> >> >>>>> écrit :
> >> >>>>>
> >> >>>>>> That means your couch is creating and destroying processes too
> >> >>>> rapidly. I
> >> >>>>>> haven't seen this, however I think Adam's message_queues stat
> above
> >> >>>> does
> >> >>>>>> the same thing. I didn't notice you can get it from there.
> >> >>>>>>
> >> >>>>>> Either way it will be useful if you can get the shell to work:
> >> >>>>>> Try this command instead for the first, the rest will be the
> same:
> >> >>>>>>
> >> >>>>>> MQSizes2 = lists:map(fun(A) -> {_,B} = case
> >> >>>>>> process_info(A,message_queue_len) of {XV,XB} -> {XV, XB}; _ERR ->
> >> >>>>>> io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, processes()).
> >> >>>>>>
> >> >>>>>> On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé <
> >> >> [email protected]
> >> >>>>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> I tried the following, but it seems to fail on the first
> command:
> >> >>>>>>>
> >> >>>>>>> --8<--
> >> >>>>>>> # /opt/couchdb/bin/remsh
> >> >>>>>>> Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8]
> >> >>>>>>> [async-threads:10] [hipe] [kernel-poll:false]
> >> >>>>>>>
> >> >>>>>>> Eshell V7.3  (abort with ^G)
> >> >>>>>>> ([email protected])1> MQSizes2 = lists:map(fun(A) -> {_,B} =
> >> >>>>>>> process_info(A,message_queue_len), {B,A} end, processes()).
> >> >>>>>>> ** exception error: no match of right hand side value undefined
> >> >>>>>>> -->8--
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Le ven. 14 juin 2019 à 16:08, Vladimir Ralev <
> >> >>>> [email protected]
> >> >>>>>>
> >> >>>>>> a
> >> >>>>>>> écrit :
> >> >>>>>>>
> >> >>>>>>>> Hey guys. I bet it's a mailbox leaking memory. I am very
> >> >>> interested
> >> >>>>> in
> >> >>>>>>>> debugging issues like this too.
> >> >>>>>>>>
> >> >>>>>>>> I can suggest to get an erlang shell and run these commands to
> >> >>> see
> >> >>>>> the
> >> >>>>>>> top
> >> >>>>>>>> memory consuming processes
> >> >>>>>>>>
> >> >>> https://www.mail-archive.com/[email protected]/msg29365.html
> >> >>>>>>>>
> >> >>>>>>>> One issue I will be reporting soon is if one of your nodes is
> >> >>> down
> >> >>>>> for
> >> >>>>>>> some
> >> >>>>>>>> amount of time, it seems like all databases independently try
> >> >> and
> >> >>>>> retry
> >> >>>>>>> to
> >> >>>>>>>> query the missing node and fail, resulting in printing a lot of
> >> >>>> logs
> >> >>>>>> for
> >> >>>>>>>> each db which can overwhelm the logger process. If you have a
> >> >> lot
> >> >>>> of
> >> >>>>>> DBs
> >> >>>>>>>> this makes the problem worse, but it doesn't happen right away
> >> >>> for
> >> >>>>> some
> >> >>>>>>>> reason.
> >> >>>>>>>>
> >> >>>>>>>> On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé <
> >> >>>>> [email protected]
> >> >>>>>>>
> >> >>>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Hi Jérôme and Adam,
> >> >>>>>>>>>
> >> >>>>>>>>> That's funny, because I'm investigating the exact same
> >> >> problem
> >> >>>>> these
> >> >>>>>>>> days.
> >> >>>>>>>>> We have a two CouchDB setups:
> >> >>>>>>>>> - a one-node server (q=2 n=1) with 5000 databases
> >> >>>>>>>>> - a 3-node cluster (q=2 n=3) with 50000 databases
> >> >>>>>>>>>
> >> >>>>>>>>> ... and we are experiencing the problem on both setups. We've
> >> >>>> been
> >> >>>>>>> having
> >> >>>>>>>>> this problem for at least 3-4 months.
> >> >>>>>>>>>
> >> >>>>>>>>> We've monitored:
> >> >>>>>>>>>
> >> >>>>>>>>> - The number of open files: it's relatively low (both the
> >> >>>> system's
> >> >>>>>>> total
> >> >>>>>>>>> and or fds opened by beam.smp).
> >> >>>>>>>>>  https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
> >> >>>>>>>>>
> >> >>>>>>>>> - The usage of RAM, total used and used by beam.smp
> >> >>>>>>>>>  https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
> >> >>>>>>>>>  It continuously grows, with regular spikes, until killing
> >> >>>> CouchDB
> >> >>>>>>> with
> >> >>>>>>>> an
> >> >>>>>>>>> OOM. After restart, the RAM usage is nice and low, and no
> >> >>> spikes.
> >> >>>>>>>>>
> >> >>>>>>>>> - /_node/_local/_system metrics, before and after restart.
> >> >>> Values
> >> >>>>>> that
> >> >>>>>>>>> significantly differ (before / after restart) are listed
> >> >> here:
> >> >>>>>>>>>  - uptime (obviously ;-))
> >> >>>>>>>>>  - memory.processes : + 3732 %
> >> >>>>>>>>>  - memory.processes_used : + 3735 %
> >> >>>>>>>>>  - memory.binary : + 17700 %
> >> >>>>>>>>>  - context_switches : + 17376 %
> >> >>>>>>>>>  - reductions : + 867832 %
> >> >>>>>>>>>  - garbage_collection_count : + 448248 %
> >> >>>>>>>>>  - words_reclaimed : + 112755 %
> >> >>>>>>>>>  - io_input : + 44226 %
> >> >>>>>>>>>  - io_output : + 157951 %
> >> >>>>>>>>>
> >> >>>>>>>>> Before CouchDB restart:
> >> >>>>>>>>> {
> >> >>>>>>>>>  "uptime":2712973,
> >> >>>>>>>>>  "memory":{
> >> >>>>>>>>>    "other":7250289,
> >> >>>>>>>>>    "atom":512625,
> >> >>>>>>>>>    "atom_used":510002,
> >> >>>>>>>>>    "processes":1877591424,
> >> >>>>>>>>>    "processes_used":1877504920,
> >> >>>>>>>>>    "binary":177468848,
> >> >>>>>>>>>    "code":9653286,
> >> >>>>>>>>>    "ets":16012736
> >> >>>>>>>>>  },
> >> >>>>>>>>>  "run_queue":0,
> >> >>>>>>>>>  "ets_table_count":102,
> >> >>>>>>>>>  "context_switches":1621495509,
> >> >>>>>>>>>  "reductions":968705947589,
> >> >>>>>>>>>  "garbage_collection_count":331826928,
> >> >>>>>>>>>  "words_reclaimed":269964293572,
> >> >>>>>>>>>  "io_input":8812455,
> >> >>>>>>>>>  "io_output":20733066,
> >> >>>>>>>>>  ...
> >> >>>>>>>>>
> >> >>>>>>>>> After CouchDB restart:
> >> >>>>>>>>> {
> >> >>>>>>>>>  "uptime":206,
> >> >>>>>>>>>  "memory":{
> >> >>>>>>>>>    "other":6907493,
> >> >>>>>>>>>    "atom":512625,
> >> >>>>>>>>>    "atom_used":497769,
> >> >>>>>>>>>    "processes":49001944,
> >> >>>>>>>>>    "processes_used":48963168,
> >> >>>>>>>>>    "binary":997032,
> >> >>>>>>>>>    "code":9233842,
> >> >>>>>>>>>    "ets":4779576
> >> >>>>>>>>>  },
> >> >>>>>>>>>  "run_queue":0,
> >> >>>>>>>>>  "ets_table_count":102,
> >> >>>>>>>>>  "context_switches":1015486,
> >> >>>>>>>>>  "reductions":111610788,
> >> >>>>>>>>>  "garbage_collection_count":74011,
> >> >>>>>>>>>  "words_reclaimed":239214127,
> >> >>>>>>>>>  "io_input":19881,
> >> >>>>>>>>>  "io_output":13118,
> >> >>>>>>>>>  ...
> >> >>>>>>>>>
> >> >>>>>>>>> Adrien
> >> >>>>>>>>>
> >> >>>>>>>>> Le ven. 14 juin 2019 à 15:11, Jérôme Augé <
> >> >>>> [email protected]
> >> >>>>>>
> >> >>>>>> a
> >> >>>>>>>>> écrit :
> >> >>>>>>>>>
> >> >>>>>>>>>> Ok, so I'll setup a cron job to journalize (every minute?)
> >> >>> the
> >> >>>>>> output
> >> >>>>>>>>> from
> >> >>>>>>>>>> "/_node/_local/_system" and wait for the next OOM kill.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Any property from "_system" to look for in particular?
> >> >>>>>>>>>>
> >> >>>>>>>>>> Here is a link to the memory usage graph:
> >> >>>>>>>>>> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
> >> >>>>>>>>>>
> >> >>>>>>>>>> The memory usage varies, but the general trend is to go up
> >> >>> with
> >> >>>>>> some
> >> >>>>>>>>>> regularity over a week until we reach OOM. When "beam.smp"
> >> >> is
> >> >>>>>> killed,
> >> >>>>>>>>> it's
> >> >>>>>>>>>> reported as consuming 15 GB (as seen in the kernel's OOM
> >> >>> trace
> >> >>>> in
> >> >>>>>>>>> syslog).
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks,
> >> >>>>>>>>>> Jérôme
> >> >>>>>>>>>>
> >> >>>>>>>>>> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <
> >> >>>>> [email protected]>
> >> >>>>>> a
> >> >>>>>>>>>> écrit :
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi Jérôme,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks for a well-written and detailed report (though the
> >> >>>>> mailing
> >> >>>>>>>> list
> >> >>>>>>>>>>> strips attachments). The _system endpoint provides a lot
> >> >> of
> >> >>>>>> useful
> >> >>>>>>>> data
> >> >>>>>>>>>> for
> >> >>>>>>>>>>> debugging these kinds of situations; do you have a
> >> >> snapshot
> >> >>>> of
> >> >>>>>> the
> >> >>>>>>>>> output
> >> >>>>>>>>>>> when the system was consuming a lot of memory?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Adam
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé <
> >> >>>>>>> [email protected]>
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Hi,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> I'm having a hard time figuring out the high memory
> >> >> usage
> >> >>>> of
> >> >>>>> a
> >> >>>>>>>>> CouchDB
> >> >>>>>>>>>>> server.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> What I'm observing is that the memory consumption from
> >> >>> the
> >> >>>>>>>> "beam.smp"
> >> >>>>>>>>>>> process gradually rises until it triggers the kernel's
> >> >> OOM
> >> >>>>>>>>>> (Out-Of-Memory)
> >> >>>>>>>>>>> which kill the "beam.smp" process.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> It also seems that many databases are not compacted:
> >> >> I've
> >> >>>>> made
> >> >>>>>> a
> >> >>>>>>>>> script
> >> >>>>>>>>>>> to iterate over the databases to compute de fragmentation
> >> >>>>> factor,
> >> >>>>>>> and
> >> >>>>>>>>> it
> >> >>>>>>>>>>> seems I have around 2100 databases with a frag > 70%.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> We have a single CouchDB v2.1.1server (configured with
> >> >>> q=8
> >> >>>>> n=1)
> >> >>>>>>> and
> >> >>>>>>>>>>> around 2770 databases.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> The server initially had 4 GB of RAM, and we are now
> >> >> with
> >> >>>> 16
> >> >>>>> GB
> >> >>>>>>> w/
> >> >>>>>>>> 8
> >> >>>>>>>>>>> vCPU, and it still regularly reaches OOM. From the
> >> >>>> monitoring I
> >> >>>>>> see
> >> >>>>>>>>> that
> >> >>>>>>>>>>> with 16 GB the OOM is almost triggered once per week
> >> >> (c.f.
> >> >>>>>> attached
> >> >>>>>>>>>> graph).
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> The memory usage seems to increase gradually until it
> >> >>>> reaches
> >> >>>>>>> OOM.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> The Couch server is mostly used by web clients with the
> >> >>>>> PouchDB
> >> >>>>>>> JS
> >> >>>>>>>>> API.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> We have ~1300 distinct users and by monitoring the
> >> >>>>> netstat/TCP
> >> >>>>>>>>>>> established connections I guess we have around 100
> >> >>> (maximum)
> >> >>>>>> users
> >> >>>>>>> at
> >> >>>>>>>>> any
> >> >>>>>>>>>>> given time. From what I understanding of the
> >> >> application's
> >> >>>>> logic,
> >> >>>>>>>> each
> >> >>>>>>>>>> user
> >> >>>>>>>>>>> access 2 private databases (read/write) + 1 common
> >> >> database
> >> >>>>>>>>> (read-only).
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> On-disk usage of CouchDB's data directory is around 40
> >> >>> GB.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Any ideas on what could cause such behavior (increasing
> >> >>>>> memory
> >> >>>>>>>> usage
> >> >>>>>>>>>>> over the course of a week)? Or how to find what is
> >> >>> happening
> >> >>>>>> behind
> >> >>>>>>>> the
> >> >>>>>>>>>>> scene?
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Regards,
> >> >>>>>>>>>>>> Jérôme
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
>

Reply via email to