Re: High memory consumption of a single node CouchDB server

Jérôme Augé Tue, 18 Jun 2019 08:31:27 -0700

"beam.smp" just got killed by OOM, but I was not in front of the machine to
perform this command...


However, here is the CouchDB log of "/_node/_local/_system" for the 30
minutes preceding the OOM:
-
https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log

I guess the spike that triggers the OOM is so quick (< 1min) that it does
not gets logged (I log every minute).

Is there anything that can be used/deduced from the last line logged at
2019-06-18T16:00:14+0200?

At 15:55:25, the "run_queue" is at 36: what does it means? Number of active
concurrent requests?

>From 15:56 to 16:00 the "memory"."other" value is a negative value: does it
means something special? or just an integer overflow?



Le lun. 17 juin 2019 à 14:09, Vladimir Ralev <vladimir.ra...@gmail.com> a
écrit :

> Alright, I think the issue will be more visible towards the OOM point,
> however for now since you have the system live with a leak, it will be
> useful to repeat the same steps, but replace
> "message_queue_len" with "total_heap_size" then with "heap_size" then with
> "stack_size" and then with "reductions".
>
> For example:
>
> MQSizes2 = lists:map(fun(A) -> {_,B} = case process_info(A,total_heap_size)
> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} end, {B,A}
> end, processes()).
>
> Then same with the other params.
>
> That can shed some light, otherwise someone will need to monitor process
> count and go into them by age and memory patterns.
>
> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <jerome.a...@anakeen.com>
> wrote:
>
> > The 2G consumption is from Adrien's system.
> >
> > On mine, since I setup the logging of "/_node/_local/_system" output :
> > - on june 14th max memory.processes was 2.6 GB
> > - on june 15th max memory.processes was 4.7 GB
> > - on june 16th max memory.processes was 7.0 GB
> > - today (june 17th) max memory.processes was 8.0 GB (and with an
> > interactive top I see spikes at 12 GB)
> >
> > The memory.processes seems to be steadily increasing over the days, and
> I'm
> > soon expecting the out-of-memory condition to be triggered in a couple of
> > days.
> >
> > Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <vladimir.ra...@gmail.com>
> a
> > écrit :
> >
> > > Nothing to see here, the message queue stat from Adam's advice is
> > accurate.
> > > Note that you should run this only when there is already an
> unreasonable
> > > amount memory leaked/consumed.
> > >
> > > But now I realise you had "processes":1877591424 before restart from
> the
> > > stats above which is less than 2G. Are you using only 2 gigs of RAM? I
> > got
> > > confused by the initial comment and I thought you had 15GB RAM. If you
> > are
> > > only using 2 gigs of RAM, it's probably not enough for your workload.
> > >
> > > On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <jerome.a...@anakeen.com>
> > > wrote:
> > >
> > > > That command seems to work, and here is the output:
> > > >
> > > > --8<--
> > > > # /opt/couchdb/bin/remsh < debug.2.remsh
> > > > Eshell V7.3  (abort with ^G)
> > > > (remsh22574@127.0.0.1)1> [{0,<0.0.0>},
> > > >  {0,<0.3.0>},
> > > >  {0,<0.6.0>},
> > > >  {0,<0.7.0>},
> > > >  {0,<0.9.0>},
> > > >  {0,<0.10.0>},
> > > >  {0,<0.11.0>},
> > > >  {0,<0.12.0>},
> > > >  {0,<0.14.0>},
> > > >  {0,<0.15.0>},
> > > >  {0,<0.16.0>},
> > > >  {0,<0.17.0>},
> > > >  {0,<0.18.0>},
> > > >  {0,<0.19.0>},
> > > >  {0,<0.20.0>},
> > > >  {0,<0.21.0>},
> > > >  {0,<0.22.0>},
> > > >  {0,<0.23.0>},
> > > >  {0,<0.24.0>},
> > > >  {0,<0.25.0>},
> > > >  {0,<0.26.0>},
> > > >  {0,<0.27.0>},
> > > >  {0,<0.28.0>},
> > > >  {0,<0.29.0>},
> > > >  {0,<0.31.0>},
> > > >  {0,<0.32.0>},
> > > >  {0,<0.33.0>},
> > > >  {0,...},
> > > >  {...}]
> > > > (remsh22574@127.0.0.1)2> {0,<0.38.0>}
> > > > (remsh22574@127.0.0.1)3> [{current_function,{erl_eval,do_apply,6}},
> > > >  {initial_call,{erlang,apply,2}},
> > > >  {status,running},
> > > >  {message_queue_len,0},
> > > >  {messages,[]},
> > > >  {links,[<0.32.0>]},
> > > >  {dictionary,[]},
> > > >  {trap_exit,false},
> > > >  {error_handler,error_handler},
> > > >  {priority,normal},
> > > >  {group_leader,<0.31.0>},
> > > >  {total_heap_size,5172},
> > > >  {heap_size,2586},
> > > >  {stack_size,24},
> > > >  {reductions,24496},
> > > >  {garbage_collection,[{min_bin_vheap_size,46422},
> > > >                       {min_heap_size,233},
> > > >                       {fullsweep_after,65535},
> > > >                       {minor_gcs,1}]},
> > > >  {suspending,[]}]
> > > > (remsh22574@127.0.0.1)4> *** Terminating erlang ('
> remsh22574@127.0.0.1
> > ')
> > > > -->8--
> > > >
> > > > What should I be looking for in this output?
> > > >
> > > > Le ven. 14 juin 2019 à 17:30, Vladimir Ralev <
> vladimir.ra...@gmail.com
> > >
> > > a
> > > > écrit :
> > > >
> > > > > That means your couch is creating and destroying processes too
> > > rapidly. I
> > > > > haven't seen this, however I think Adam's message_queues stat above
> > > does
> > > > > the same thing. I didn't notice you can get it from there.
> > > > >
> > > > > Either way it will be useful if you can get the shell to work:
> > > > > Try this command instead for the first, the rest will be the same:
> > > > >
> > > > > MQSizes2 = lists:map(fun(A) -> {_,B} = case
> > > > > process_info(A,message_queue_len) of {XV,XB} -> {XV, XB}; _ERR ->
> > > > > io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, processes()).
> > > > >
> > > > > On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé <
> jerome.a...@anakeen.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I tried the following, but it seems to fail on the first command:
> > > > > >
> > > > > > --8<--
> > > > > > # /opt/couchdb/bin/remsh
> > > > > > Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8]
> > > > > > [async-threads:10] [hipe] [kernel-poll:false]
> > > > > >
> > > > > > Eshell V7.3  (abort with ^G)
> > > > > > (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A) -> {_,B} =
> > > > > > process_info(A,message_queue_len), {B,A} end, processes()).
> > > > > > ** exception error: no match of right hand side value undefined
> > > > > > -->8--
> > > > > >
> > > > > >
> > > > > > Le ven. 14 juin 2019 à 16:08, Vladimir Ralev <
> > > vladimir.ra...@gmail.com
> > > > >
> > > > > a
> > > > > > écrit :
> > > > > >
> > > > > > > Hey guys. I bet it's a mailbox leaking memory. I am very
> > interested
> > > > in
> > > > > > > debugging issues like this too.
> > > > > > >
> > > > > > > I can suggest to get an erlang shell and run these commands to
> > see
> > > > the
> > > > > > top
> > > > > > > memory consuming processes
> > > > > > >
> > https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html
> > > > > > >
> > > > > > > One issue I will be reporting soon is if one of your nodes is
> > down
> > > > for
> > > > > > some
> > > > > > > amount of time, it seems like all databases independently try
> and
> > > > retry
> > > > > > to
> > > > > > > query the missing node and fail, resulting in printing a lot of
> > > logs
> > > > > for
> > > > > > > each db which can overwhelm the logger process. If you have a
> lot
> > > of
> > > > > DBs
> > > > > > > this makes the problem worse, but it doesn't happen right away
> > for
> > > > some
> > > > > > > reason.
> > > > > > >
> > > > > > > On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé <
> > > > adrien.ve...@tolteck.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jérôme and Adam,
> > > > > > > >
> > > > > > > > That's funny, because I'm investigating the exact same
> problem
> > > > these
> > > > > > > days.
> > > > > > > > We have a two CouchDB setups:
> > > > > > > > - a one-node server (q=2 n=1) with 5000 databases
> > > > > > > > - a 3-node cluster (q=2 n=3) with 50000 databases
> > > > > > > >
> > > > > > > > ... and we are experiencing the problem on both setups. We've
> > > been
> > > > > > having
> > > > > > > > this problem for at least 3-4 months.
> > > > > > > >
> > > > > > > > We've monitored:
> > > > > > > >
> > > > > > > > - The number of open files: it's relatively low (both the
> > > system's
> > > > > > total
> > > > > > > > and or fds opened by beam.smp).
> > > > > > > >   https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
> > > > > > > >
> > > > > > > > - The usage of RAM, total used and used by beam.smp
> > > > > > > >   https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
> > > > > > > >   It continuously grows, with regular spikes, until killing
> > > CouchDB
> > > > > > with
> > > > > > > an
> > > > > > > > OOM. After restart, the RAM usage is nice and low, and no
> > spikes.
> > > > > > > >
> > > > > > > > - /_node/_local/_system metrics, before and after restart.
> > Values
> > > > > that
> > > > > > > > significantly differ (before / after restart) are listed
> here:
> > > > > > > >   - uptime (obviously ;-))
> > > > > > > >   - memory.processes : + 3732 %
> > > > > > > >   - memory.processes_used : + 3735 %
> > > > > > > >   - memory.binary : + 17700 %
> > > > > > > >   - context_switches : + 17376 %
> > > > > > > >   - reductions : + 867832 %
> > > > > > > >   - garbage_collection_count : + 448248 %
> > > > > > > >   - words_reclaimed : + 112755 %
> > > > > > > >   - io_input : + 44226 %
> > > > > > > >   - io_output : + 157951 %
> > > > > > > >
> > > > > > > > Before CouchDB restart:
> > > > > > > > {
> > > > > > > >   "uptime":2712973,
> > > > > > > >   "memory":{
> > > > > > > >     "other":7250289,
> > > > > > > >     "atom":512625,
> > > > > > > >     "atom_used":510002,
> > > > > > > >     "processes":1877591424,
> > > > > > > >     "processes_used":1877504920,
> > > > > > > >     "binary":177468848,
> > > > > > > >     "code":9653286,
> > > > > > > >     "ets":16012736
> > > > > > > >   },
> > > > > > > >   "run_queue":0,
> > > > > > > >   "ets_table_count":102,
> > > > > > > >   "context_switches":1621495509,
> > > > > > > >   "reductions":968705947589,
> > > > > > > >   "garbage_collection_count":331826928,
> > > > > > > >   "words_reclaimed":269964293572,
> > > > > > > >   "io_input":8812455,
> > > > > > > >   "io_output":20733066,
> > > > > > > >   ...
> > > > > > > >
> > > > > > > > After CouchDB restart:
> > > > > > > > {
> > > > > > > >   "uptime":206,
> > > > > > > >   "memory":{
> > > > > > > >     "other":6907493,
> > > > > > > >     "atom":512625,
> > > > > > > >     "atom_used":497769,
> > > > > > > >     "processes":49001944,
> > > > > > > >     "processes_used":48963168,
> > > > > > > >     "binary":997032,
> > > > > > > >     "code":9233842,
> > > > > > > >     "ets":4779576
> > > > > > > >   },
> > > > > > > >   "run_queue":0,
> > > > > > > >   "ets_table_count":102,
> > > > > > > >   "context_switches":1015486,
> > > > > > > >   "reductions":111610788,
> > > > > > > >   "garbage_collection_count":74011,
> > > > > > > >   "words_reclaimed":239214127,
> > > > > > > >   "io_input":19881,
> > > > > > > >   "io_output":13118,
> > > > > > > >   ...
> > > > > > > >
> > > > > > > > Adrien
> > > > > > > >
> > > > > > > > Le ven. 14 juin 2019 à 15:11, Jérôme Augé <
> > > jerome.a...@anakeen.com
> > > > >
> > > > > a
> > > > > > > > écrit :
> > > > > > > >
> > > > > > > > > Ok, so I'll setup a cron job to journalize (every minute?)
> > the
> > > > > output
> > > > > > > > from
> > > > > > > > > "/_node/_local/_system" and wait for the next OOM kill.
> > > > > > > > >
> > > > > > > > > Any property from "_system" to look for in particular?
> > > > > > > > >
> > > > > > > > > Here is a link to the memory usage graph:
> > > > > > > > > https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
> > > > > > > > >
> > > > > > > > > The memory usage varies, but the general trend is to go up
> > with
> > > > > some
> > > > > > > > > regularity over a week until we reach OOM. When "beam.smp"
> is
> > > > > killed,
> > > > > > > > it's
> > > > > > > > > reported as consuming 15 GB (as seen in the kernel's OOM
> > trace
> > > in
> > > > > > > > syslog).
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jérôme
> > > > > > > > >
> > > > > > > > > Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <
> > > > kocol...@apache.org>
> > > > > a
> > > > > > > > > écrit :
> > > > > > > > >
> > > > > > > > > > Hi Jérôme,
> > > > > > > > > >
> > > > > > > > > > Thanks for a well-written and detailed report (though the
> > > > mailing
> > > > > > > list
> > > > > > > > > > strips attachments). The _system endpoint provides a lot
> of
> > > > > useful
> > > > > > > data
> > > > > > > > > for
> > > > > > > > > > debugging these kinds of situations; do you have a
> snapshot
> > > of
> > > > > the
> > > > > > > > output
> > > > > > > > > > when the system was consuming a lot of memory?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
> > > > > > > > > >
> > > > > > > > > > Adam
> > > > > > > > > >
> > > > > > > > > > > On Jun 14, 2019, at 5:44 AM, Jérôme Augé <
> > > > > > jerome.a...@anakeen.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I'm having a hard time figuring out the high memory
> usage
> > > of
> > > > a
> > > > > > > > CouchDB
> > > > > > > > > > server.
> > > > > > > > > > >
> > > > > > > > > > > What I'm observing is that the memory consumption from
> > the
> > > > > > > "beam.smp"
> > > > > > > > > > process gradually rises until it triggers the kernel's
> OOM
> > > > > > > > > (Out-Of-Memory)
> > > > > > > > > > which kill the "beam.smp" process.
> > > > > > > > > > >
> > > > > > > > > > > It also seems that many databases are not compacted:
> I've
> > > > made
> > > > > a
> > > > > > > > script
> > > > > > > > > > to iterate over the databases to compute de fragmentation
> > > > factor,
> > > > > > and
> > > > > > > > it
> > > > > > > > > > seems I have around 2100 databases with a frag > 70%.
> > > > > > > > > > >
> > > > > > > > > > > We have a single CouchDB v2.1.1server (configured with
> > q=8
> > > > n=1)
> > > > > > and
> > > > > > > > > > around 2770 databases.
> > > > > > > > > > >
> > > > > > > > > > > The server initially had 4 GB of RAM, and we are now
> with
> > > 16
> > > > GB
> > > > > > w/
> > > > > > > 8
> > > > > > > > > > vCPU, and it still regularly reaches OOM. From the
> > > monitoring I
> > > > > see
> > > > > > > > that
> > > > > > > > > > with 16 GB the OOM is almost triggered once per week
> (c.f.
> > > > > attached
> > > > > > > > > graph).
> > > > > > > > > > >
> > > > > > > > > > > The memory usage seems to increase gradually until it
> > > reaches
> > > > > > OOM.
> > > > > > > > > > >
> > > > > > > > > > > The Couch server is mostly used by web clients with the
> > > > PouchDB
> > > > > > JS
> > > > > > > > API.
> > > > > > > > > > >
> > > > > > > > > > > We have ~1300 distinct users and by monitoring the
> > > > netstat/TCP
> > > > > > > > > > established connections I guess we have around 100
> > (maximum)
> > > > > users
> > > > > > at
> > > > > > > > any
> > > > > > > > > > given time. From what I understanding of the
> application's
> > > > logic,
> > > > > > > each
> > > > > > > > > user
> > > > > > > > > > access 2 private databases (read/write) + 1 common
> database
> > > > > > > > (read-only).
> > > > > > > > > > >
> > > > > > > > > > > On-disk usage of CouchDB's data directory is around 40
> > GB.
> > > > > > > > > > >
> > > > > > > > > > > Any ideas on what could cause such behavior (increasing
> > > > memory
> > > > > > > usage
> > > > > > > > > > over the course of a week)? Or how to find what is
> > happening
> > > > > behind
> > > > > > > the
> > > > > > > > > > scene?
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Jérôme
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: High memory consumption of a single node CouchDB server

Reply via email to