Alright, I think the issue will be more visible towards the OOM point,
however for now since you have the system live with a leak, it will be
useful to repeat the same steps, but replace
"message_queue_len" with "total_heap_size" then with "heap_size" then with
"stack_size" and then with "reductions".

For example:

MQSizes2 = lists:map(fun(A) -> {_,B} = case process_info(A,total_heap_size)
of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} end, {B,A}
end, processes()).

Then same with the other params.

That can shed some light, otherwise someone will need to monitor process
count and go into them by age and memory patterns.

On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <jerome.a...@anakeen.com> wrote:

> The 2G consumption is from Adrien's system.
>
> On mine, since I setup the logging of "/_node/_local/_system" output :
> - on june 14th max memory.processes was 2.6 GB
> - on june 15th max memory.processes was 4.7 GB
> - on june 16th max memory.processes was 7.0 GB
> - today (june 17th) max memory.processes was 8.0 GB (and with an
> interactive top I see spikes at 12 GB)
>
> The memory.processes seems to be steadily increasing over the days, and I'm
> soon expecting the out-of-memory condition to be triggered in a couple of
> days.
>
> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <vladimir.ra...@gmail.com> a
> écrit :
>
> > Nothing to see here, the message queue stat from Adam's advice is
> accurate.
> > Note that you should run this only when there is already an unreasonable
> > amount memory leaked/consumed.
> >
> > But now I realise you had "processes":1877591424 before restart from the
> > stats above which is less than 2G. Are you using only 2 gigs of RAM? I
> got
> > confused by the initial comment and I thought you had 15GB RAM. If you
> are
> > only using 2 gigs of RAM, it's probably not enough for your workload.
> >
> > On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <jerome.a...@anakeen.com>
> > wrote:
> >
> > > That command seems to work, and here is the output:
> > >
> > > --8<--
> > > # /opt/couchdb/bin/remsh < debug.2.remsh
> > > Eshell V7.3  (abort with ^G)
> > > (remsh22574@127.0.0.1)1> [{0,<0.0.0>},
> > >  {0,<0.3.0>},
> > >  {0,<0.6.0>},
> > >  {0,<0.7.0>},
> > >  {0,<0.9.0>},
> > >  {0,<0.10.0>},
> > >  {0,<0.11.0>},
> > >  {0,<0.12.0>},
> > >  {0,<0.14.0>},
> > >  {0,<0.15.0>},
> > >  {0,<0.16.0>},
> > >  {0,<0.17.0>},
> > >  {0,<0.18.0>},
> > >  {0,<0.19.0>},
> > >  {0,<0.20.0>},
> > >  {0,<0.21.0>},
> > >  {0,<0.22.0>},
> > >  {0,<0.23.0>},
> > >  {0,<0.24.0>},
> > >  {0,<0.25.0>},
> > >  {0,<0.26.0>},
> > >  {0,<0.27.0>},
> > >  {0,<0.28.0>},
> > >  {0,<0.29.0>},
> > >  {0,<0.31.0>},
> > >  {0,<0.32.0>},
> > >  {0,<0.33.0>},
> > >  {0,...},
> > >  {...}]
> > > (remsh22574@127.0.0.1)2> {0,<0.38.0>}
> > > (remsh22574@127.0.0.1)3> [{current_function,{erl_eval,do_apply,6}},
> > >  {initial_call,{erlang,apply,2}},
> > >  {status,running},
> > >  {message_queue_len,0},
> > >  {messages,[]},
> > >  {links,[<0.32.0>]},
> > >  {dictionary,[]},
> > >  {trap_exit,false},
> > >  {error_handler,error_handler},
> > >  {priority,normal},
> > >  {group_leader,<0.31.0>},
> > >  {total_heap_size,5172},
> > >  {heap_size,2586},
> > >  {stack_size,24},
> > >  {reductions,24496},
> > >  {garbage_collection,[{min_bin_vheap_size,46422},
> > >                       {min_heap_size,233},
> > >                       {fullsweep_after,65535},
> > >                       {minor_gcs,1}]},
> > >  {suspending,[]}]
> > > (remsh22574@127.0.0.1)4> *** Terminating erlang ('remsh22574@127.0.0.1
> ')
> > > -->8--
> > >
> > > What should I be looking for in this output?
> > >
> > > Le ven. 14 juin 2019 à 17:30, Vladimir Ralev <vladimir.ra...@gmail.com
> >
> > a
> > > écrit :
> > >
> > > > That means your couch is creating and destroying processes too
> > rapidly. I
> > > > haven't seen this, however I think Adam's message_queues stat above
> > does
> > > > the same thing. I didn't notice you can get it from there.
> > > >
> > > > Either way it will be useful if you can get the shell to work:
> > > > Try this command instead for the first, the rest will be the same:
> > > >
> > > > MQSizes2 = lists:map(fun(A) -> {_,B} = case
> > > > process_info(A,message_queue_len) of {XV,XB} -> {XV, XB}; _ERR ->
> > > > io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, processes()).
> > > >
> > > > On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé <jerome.a...@anakeen.com
> >
> > > > wrote:
> > > >
> > > > > I tried the following, but it seems to fail on the first command:
> > > > >
> > > > > --8<--
> > > > > # /opt/couchdb/bin/remsh
> > > > > Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8]
> > > > > [async-threads:10] [hipe] [kernel-poll:false]
> > > > >
> > > > > Eshell V7.3  (abort with ^G)
> > > > > (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A) -> {_,B} =
> > > > > process_info(A,message_queue_len), {B,A} end, processes()).
> > > > > ** exception error: no match of right hand side value undefined
> > > > > -->8--
> > > > >
> > > > >
> > > > > Le ven. 14 juin 2019 à 16:08, Vladimir Ralev <
> > vladimir.ra...@gmail.com
> > > >
> > > > a
> > > > > écrit :
> > > > >
> > > > > > Hey guys. I bet it's a mailbox leaking memory. I am very
> interested
> > > in
> > > > > > debugging issues like this too.
> > > > > >
> > > > > > I can suggest to get an erlang shell and run these commands to
> see
> > > the
> > > > > top
> > > > > > memory consuming processes
> > > > > >
> https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html
> > > > > >
> > > > > > One issue I will be reporting soon is if one of your nodes is
> down
> > > for
> > > > > some
> > > > > > amount of time, it seems like all databases independently try and
> > > retry
> > > > > to
> > > > > > query the missing node and fail, resulting in printing a lot of
> > logs
> > > > for
> > > > > > each db which can overwhelm the logger process. If you have a lot
> > of
> > > > DBs
> > > > > > this makes the problem worse, but it doesn't happen right away
> for
> > > some
> > > > > > reason.
> > > > > >
> > > > > > On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé <
> > > adrien.ve...@tolteck.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jérôme and Adam,
> > > > > > >
> > > > > > > That's funny, because I'm investigating the exact same problem
> > > these
> > > > > > days.
> > > > > > > We have a two CouchDB setups:
> > > > > > > - a one-node server (q=2 n=1) with 5000 databases
> > > > > > > - a 3-node cluster (q=2 n=3) with 50000 databases
> > > > > > >
> > > > > > > ... and we are experiencing the problem on both setups. We've
> > been
> > > > > having
> > > > > > > this problem for at least 3-4 months.
> > > > > > >
> > > > > > > We've monitored:
> > > > > > >
> > > > > > > - The number of open files: it's relatively low (both the
> > system's
> > > > > total
> > > > > > > and or fds opened by beam.smp).
> > > > > > >   https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
> > > > > > >
> > > > > > > - The usage of RAM, total used and used by beam.smp
> > > > > > >   https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
> > > > > > >   It continuously grows, with regular spikes, until killing
> > CouchDB
> > > > > with
> > > > > > an
> > > > > > > OOM. After restart, the RAM usage is nice and low, and no
> spikes.
> > > > > > >
> > > > > > > - /_node/_local/_system metrics, before and after restart.
> Values
> > > > that
> > > > > > > significantly differ (before / after restart) are listed here:
> > > > > > >   - uptime (obviously ;-))
> > > > > > >   - memory.processes : + 3732 %
> > > > > > >   - memory.processes_used : + 3735 %
> > > > > > >   - memory.binary : + 17700 %
> > > > > > >   - context_switches : + 17376 %
> > > > > > >   - reductions : + 867832 %
> > > > > > >   - garbage_collection_count : + 448248 %
> > > > > > >   - words_reclaimed : + 112755 %
> > > > > > >   - io_input : + 44226 %
> > > > > > >   - io_output : + 157951 %
> > > > > > >
> > > > > > > Before CouchDB restart:
> > > > > > > {
> > > > > > >   "uptime":2712973,
> > > > > > >   "memory":{
> > > > > > >     "other":7250289,
> > > > > > >     "atom":512625,
> > > > > > >     "atom_used":510002,
> > > > > > >     "processes":1877591424,
> > > > > > >     "processes_used":1877504920,
> > > > > > >     "binary":177468848,
> > > > > > >     "code":9653286,
> > > > > > >     "ets":16012736
> > > > > > >   },
> > > > > > >   "run_queue":0,
> > > > > > >   "ets_table_count":102,
> > > > > > >   "context_switches":1621495509,
> > > > > > >   "reductions":968705947589,
> > > > > > >   "garbage_collection_count":331826928,
> > > > > > >   "words_reclaimed":269964293572,
> > > > > > >   "io_input":8812455,
> > > > > > >   "io_output":20733066,
> > > > > > >   ...
> > > > > > >
> > > > > > > After CouchDB restart:
> > > > > > > {
> > > > > > >   "uptime":206,
> > > > > > >   "memory":{
> > > > > > >     "other":6907493,
> > > > > > >     "atom":512625,
> > > > > > >     "atom_used":497769,
> > > > > > >     "processes":49001944,
> > > > > > >     "processes_used":48963168,
> > > > > > >     "binary":997032,
> > > > > > >     "code":9233842,
> > > > > > >     "ets":4779576
> > > > > > >   },
> > > > > > >   "run_queue":0,
> > > > > > >   "ets_table_count":102,
> > > > > > >   "context_switches":1015486,
> > > > > > >   "reductions":111610788,
> > > > > > >   "garbage_collection_count":74011,
> > > > > > >   "words_reclaimed":239214127,
> > > > > > >   "io_input":19881,
> > > > > > >   "io_output":13118,
> > > > > > >   ...
> > > > > > >
> > > > > > > Adrien
> > > > > > >
> > > > > > > Le ven. 14 juin 2019 à 15:11, Jérôme Augé <
> > jerome.a...@anakeen.com
> > > >
> > > > a
> > > > > > > écrit :
> > > > > > >
> > > > > > > > Ok, so I'll setup a cron job to journalize (every minute?)
> the
> > > > output
> > > > > > > from
> > > > > > > > "/_node/_local/_system" and wait for the next OOM kill.
> > > > > > > >
> > > > > > > > Any property from "_system" to look for in particular?
> > > > > > > >
> > > > > > > > Here is a link to the memory usage graph:
> > > > > > > > https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
> > > > > > > >
> > > > > > > > The memory usage varies, but the general trend is to go up
> with
> > > > some
> > > > > > > > regularity over a week until we reach OOM. When "beam.smp" is
> > > > killed,
> > > > > > > it's
> > > > > > > > reported as consuming 15 GB (as seen in the kernel's OOM
> trace
> > in
> > > > > > > syslog).
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jérôme
> > > > > > > >
> > > > > > > > Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <
> > > kocol...@apache.org>
> > > > a
> > > > > > > > écrit :
> > > > > > > >
> > > > > > > > > Hi Jérôme,
> > > > > > > > >
> > > > > > > > > Thanks for a well-written and detailed report (though the
> > > mailing
> > > > > > list
> > > > > > > > > strips attachments). The _system endpoint provides a lot of
> > > > useful
> > > > > > data
> > > > > > > > for
> > > > > > > > > debugging these kinds of situations; do you have a snapshot
> > of
> > > > the
> > > > > > > output
> > > > > > > > > when the system was consuming a lot of memory?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
> > > > > > > > >
> > > > > > > > > Adam
> > > > > > > > >
> > > > > > > > > > On Jun 14, 2019, at 5:44 AM, Jérôme Augé <
> > > > > jerome.a...@anakeen.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I'm having a hard time figuring out the high memory usage
> > of
> > > a
> > > > > > > CouchDB
> > > > > > > > > server.
> > > > > > > > > >
> > > > > > > > > > What I'm observing is that the memory consumption from
> the
> > > > > > "beam.smp"
> > > > > > > > > process gradually rises until it triggers the kernel's OOM
> > > > > > > > (Out-Of-Memory)
> > > > > > > > > which kill the "beam.smp" process.
> > > > > > > > > >
> > > > > > > > > > It also seems that many databases are not compacted: I've
> > > made
> > > > a
> > > > > > > script
> > > > > > > > > to iterate over the databases to compute de fragmentation
> > > factor,
> > > > > and
> > > > > > > it
> > > > > > > > > seems I have around 2100 databases with a frag > 70%.
> > > > > > > > > >
> > > > > > > > > > We have a single CouchDB v2.1.1server (configured with
> q=8
> > > n=1)
> > > > > and
> > > > > > > > > around 2770 databases.
> > > > > > > > > >
> > > > > > > > > > The server initially had 4 GB of RAM, and we are now with
> > 16
> > > GB
> > > > > w/
> > > > > > 8
> > > > > > > > > vCPU, and it still regularly reaches OOM. From the
> > monitoring I
> > > > see
> > > > > > > that
> > > > > > > > > with 16 GB the OOM is almost triggered once per week (c.f.
> > > > attached
> > > > > > > > graph).
> > > > > > > > > >
> > > > > > > > > > The memory usage seems to increase gradually until it
> > reaches
> > > > > OOM.
> > > > > > > > > >
> > > > > > > > > > The Couch server is mostly used by web clients with the
> > > PouchDB
> > > > > JS
> > > > > > > API.
> > > > > > > > > >
> > > > > > > > > > We have ~1300 distinct users and by monitoring the
> > > netstat/TCP
> > > > > > > > > established connections I guess we have around 100
> (maximum)
> > > > users
> > > > > at
> > > > > > > any
> > > > > > > > > given time. From what I understanding of the application's
> > > logic,
> > > > > > each
> > > > > > > > user
> > > > > > > > > access 2 private databases (read/write) + 1 common database
> > > > > > > (read-only).
> > > > > > > > > >
> > > > > > > > > > On-disk usage of CouchDB's data directory is around 40
> GB.
> > > > > > > > > >
> > > > > > > > > > Any ideas on what could cause such behavior (increasing
> > > memory
> > > > > > usage
> > > > > > > > > over the course of a week)? Or how to find what is
> happening
> > > > behind
> > > > > > the
> > > > > > > > > scene?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Jérôme
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to