"beam.smp" just got killed by OOM, but I was not in front of the machine to perform this command...
However, here is the CouchDB log of "/_node/_local/_system" for the 30 minutes preceding the OOM: - https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log I guess the spike that triggers the OOM is so quick (< 1min) that it does not gets logged (I log every minute). Is there anything that can be used/deduced from the last line logged at 2019-06-18T16:00:14+0200? At 15:55:25, the "run_queue" is at 36: what does it means? Number of active concurrent requests? >From 15:56 to 16:00 the "memory"."other" value is a negative value: does it means something special? or just an integer overflow? Le lun. 17 juin 2019 à 14:09, Vladimir Ralev <vladimir.ra...@gmail.com> a écrit : > Alright, I think the issue will be more visible towards the OOM point, > however for now since you have the system live with a leak, it will be > useful to repeat the same steps, but replace > "message_queue_len" with "total_heap_size" then with "heap_size" then with > "stack_size" and then with "reductions". > > For example: > > MQSizes2 = lists:map(fun(A) -> {_,B} = case process_info(A,total_heap_size) > of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} end, {B,A} > end, processes()). > > Then same with the other params. > > That can shed some light, otherwise someone will need to monitor process > count and go into them by age and memory patterns. > > On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <jerome.a...@anakeen.com> > wrote: > > > The 2G consumption is from Adrien's system. > > > > On mine, since I setup the logging of "/_node/_local/_system" output : > > - on june 14th max memory.processes was 2.6 GB > > - on june 15th max memory.processes was 4.7 GB > > - on june 16th max memory.processes was 7.0 GB > > - today (june 17th) max memory.processes was 8.0 GB (and with an > > interactive top I see spikes at 12 GB) > > > > The memory.processes seems to be steadily increasing over the days, and > I'm > > soon expecting the out-of-memory condition to be triggered in a couple of > > days. > > > > Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <vladimir.ra...@gmail.com> > a > > écrit : > > > > > Nothing to see here, the message queue stat from Adam's advice is > > accurate. > > > Note that you should run this only when there is already an > unreasonable > > > amount memory leaked/consumed. > > > > > > But now I realise you had "processes":1877591424 before restart from > the > > > stats above which is less than 2G. Are you using only 2 gigs of RAM? I > > got > > > confused by the initial comment and I thought you had 15GB RAM. If you > > are > > > only using 2 gigs of RAM, it's probably not enough for your workload. > > > > > > On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <jerome.a...@anakeen.com> > > > wrote: > > > > > > > That command seems to work, and here is the output: > > > > > > > > --8<-- > > > > # /opt/couchdb/bin/remsh < debug.2.remsh > > > > Eshell V7.3 (abort with ^G) > > > > (remsh22574@127.0.0.1)1> [{0,<0.0.0>}, > > > > {0,<0.3.0>}, > > > > {0,<0.6.0>}, > > > > {0,<0.7.0>}, > > > > {0,<0.9.0>}, > > > > {0,<0.10.0>}, > > > > {0,<0.11.0>}, > > > > {0,<0.12.0>}, > > > > {0,<0.14.0>}, > > > > {0,<0.15.0>}, > > > > {0,<0.16.0>}, > > > > {0,<0.17.0>}, > > > > {0,<0.18.0>}, > > > > {0,<0.19.0>}, > > > > {0,<0.20.0>}, > > > > {0,<0.21.0>}, > > > > {0,<0.22.0>}, > > > > {0,<0.23.0>}, > > > > {0,<0.24.0>}, > > > > {0,<0.25.0>}, > > > > {0,<0.26.0>}, > > > > {0,<0.27.0>}, > > > > {0,<0.28.0>}, > > > > {0,<0.29.0>}, > > > > {0,<0.31.0>}, > > > > {0,<0.32.0>}, > > > > {0,<0.33.0>}, > > > > {0,...}, > > > > {...}] > > > > (remsh22574@127.0.0.1)2> {0,<0.38.0>} > > > > (remsh22574@127.0.0.1)3> [{current_function,{erl_eval,do_apply,6}}, > > > > {initial_call,{erlang,apply,2}}, > > > > {status,running}, > > > > {message_queue_len,0}, > > > > {messages,[]}, > > > > {links,[<0.32.0>]}, > > > > {dictionary,[]}, > > > > {trap_exit,false}, > > > > {error_handler,error_handler}, > > > > {priority,normal}, > > > > {group_leader,<0.31.0>}, > > > > {total_heap_size,5172}, > > > > {heap_size,2586}, > > > > {stack_size,24}, > > > > {reductions,24496}, > > > > {garbage_collection,[{min_bin_vheap_size,46422}, > > > > {min_heap_size,233}, > > > > {fullsweep_after,65535}, > > > > {minor_gcs,1}]}, > > > > {suspending,[]}] > > > > (remsh22574@127.0.0.1)4> *** Terminating erlang (' > remsh22574@127.0.0.1 > > ') > > > > -->8-- > > > > > > > > What should I be looking for in this output? > > > > > > > > Le ven. 14 juin 2019 à 17:30, Vladimir Ralev < > vladimir.ra...@gmail.com > > > > > > a > > > > écrit : > > > > > > > > > That means your couch is creating and destroying processes too > > > rapidly. I > > > > > haven't seen this, however I think Adam's message_queues stat above > > > does > > > > > the same thing. I didn't notice you can get it from there. > > > > > > > > > > Either way it will be useful if you can get the shell to work: > > > > > Try this command instead for the first, the rest will be the same: > > > > > > > > > > MQSizes2 = lists:map(fun(A) -> {_,B} = case > > > > > process_info(A,message_queue_len) of {XV,XB} -> {XV, XB}; _ERR -> > > > > > io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, processes()). > > > > > > > > > > On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé < > jerome.a...@anakeen.com > > > > > > > > wrote: > > > > > > > > > > > I tried the following, but it seems to fail on the first command: > > > > > > > > > > > > --8<-- > > > > > > # /opt/couchdb/bin/remsh > > > > > > Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8] > > > > > > [async-threads:10] [hipe] [kernel-poll:false] > > > > > > > > > > > > Eshell V7.3 (abort with ^G) > > > > > > (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A) -> {_,B} = > > > > > > process_info(A,message_queue_len), {B,A} end, processes()). > > > > > > ** exception error: no match of right hand side value undefined > > > > > > -->8-- > > > > > > > > > > > > > > > > > > Le ven. 14 juin 2019 à 16:08, Vladimir Ralev < > > > vladimir.ra...@gmail.com > > > > > > > > > > a > > > > > > écrit : > > > > > > > > > > > > > Hey guys. I bet it's a mailbox leaking memory. I am very > > interested > > > > in > > > > > > > debugging issues like this too. > > > > > > > > > > > > > > I can suggest to get an erlang shell and run these commands to > > see > > > > the > > > > > > top > > > > > > > memory consuming processes > > > > > > > > > https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html > > > > > > > > > > > > > > One issue I will be reporting soon is if one of your nodes is > > down > > > > for > > > > > > some > > > > > > > amount of time, it seems like all databases independently try > and > > > > retry > > > > > > to > > > > > > > query the missing node and fail, resulting in printing a lot of > > > logs > > > > > for > > > > > > > each db which can overwhelm the logger process. If you have a > lot > > > of > > > > > DBs > > > > > > > this makes the problem worse, but it doesn't happen right away > > for > > > > some > > > > > > > reason. > > > > > > > > > > > > > > On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé < > > > > adrien.ve...@tolteck.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Jérôme and Adam, > > > > > > > > > > > > > > > > That's funny, because I'm investigating the exact same > problem > > > > these > > > > > > > days. > > > > > > > > We have a two CouchDB setups: > > > > > > > > - a one-node server (q=2 n=1) with 5000 databases > > > > > > > > - a 3-node cluster (q=2 n=3) with 50000 databases > > > > > > > > > > > > > > > > ... and we are experiencing the problem on both setups. We've > > > been > > > > > > having > > > > > > > > this problem for at least 3-4 months. > > > > > > > > > > > > > > > > We've monitored: > > > > > > > > > > > > > > > > - The number of open files: it's relatively low (both the > > > system's > > > > > > total > > > > > > > > and or fds opened by beam.smp). > > > > > > > > https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png > > > > > > > > > > > > > > > > - The usage of RAM, total used and used by beam.smp > > > > > > > > https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png > > > > > > > > It continuously grows, with regular spikes, until killing > > > CouchDB > > > > > > with > > > > > > > an > > > > > > > > OOM. After restart, the RAM usage is nice and low, and no > > spikes. > > > > > > > > > > > > > > > > - /_node/_local/_system metrics, before and after restart. > > Values > > > > > that > > > > > > > > significantly differ (before / after restart) are listed > here: > > > > > > > > - uptime (obviously ;-)) > > > > > > > > - memory.processes : + 3732 % > > > > > > > > - memory.processes_used : + 3735 % > > > > > > > > - memory.binary : + 17700 % > > > > > > > > - context_switches : + 17376 % > > > > > > > > - reductions : + 867832 % > > > > > > > > - garbage_collection_count : + 448248 % > > > > > > > > - words_reclaimed : + 112755 % > > > > > > > > - io_input : + 44226 % > > > > > > > > - io_output : + 157951 % > > > > > > > > > > > > > > > > Before CouchDB restart: > > > > > > > > { > > > > > > > > "uptime":2712973, > > > > > > > > "memory":{ > > > > > > > > "other":7250289, > > > > > > > > "atom":512625, > > > > > > > > "atom_used":510002, > > > > > > > > "processes":1877591424, > > > > > > > > "processes_used":1877504920, > > > > > > > > "binary":177468848, > > > > > > > > "code":9653286, > > > > > > > > "ets":16012736 > > > > > > > > }, > > > > > > > > "run_queue":0, > > > > > > > > "ets_table_count":102, > > > > > > > > "context_switches":1621495509, > > > > > > > > "reductions":968705947589, > > > > > > > > "garbage_collection_count":331826928, > > > > > > > > "words_reclaimed":269964293572, > > > > > > > > "io_input":8812455, > > > > > > > > "io_output":20733066, > > > > > > > > ... > > > > > > > > > > > > > > > > After CouchDB restart: > > > > > > > > { > > > > > > > > "uptime":206, > > > > > > > > "memory":{ > > > > > > > > "other":6907493, > > > > > > > > "atom":512625, > > > > > > > > "atom_used":497769, > > > > > > > > "processes":49001944, > > > > > > > > "processes_used":48963168, > > > > > > > > "binary":997032, > > > > > > > > "code":9233842, > > > > > > > > "ets":4779576 > > > > > > > > }, > > > > > > > > "run_queue":0, > > > > > > > > "ets_table_count":102, > > > > > > > > "context_switches":1015486, > > > > > > > > "reductions":111610788, > > > > > > > > "garbage_collection_count":74011, > > > > > > > > "words_reclaimed":239214127, > > > > > > > > "io_input":19881, > > > > > > > > "io_output":13118, > > > > > > > > ... > > > > > > > > > > > > > > > > Adrien > > > > > > > > > > > > > > > > Le ven. 14 juin 2019 à 15:11, Jérôme Augé < > > > jerome.a...@anakeen.com > > > > > > > > > > a > > > > > > > > écrit : > > > > > > > > > > > > > > > > > Ok, so I'll setup a cron job to journalize (every minute?) > > the > > > > > output > > > > > > > > from > > > > > > > > > "/_node/_local/_system" and wait for the next OOM kill. > > > > > > > > > > > > > > > > > > Any property from "_system" to look for in particular? > > > > > > > > > > > > > > > > > > Here is a link to the memory usage graph: > > > > > > > > > https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png > > > > > > > > > > > > > > > > > > The memory usage varies, but the general trend is to go up > > with > > > > > some > > > > > > > > > regularity over a week until we reach OOM. When "beam.smp" > is > > > > > killed, > > > > > > > > it's > > > > > > > > > reported as consuming 15 GB (as seen in the kernel's OOM > > trace > > > in > > > > > > > > syslog). > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Jérôme > > > > > > > > > > > > > > > > > > Le ven. 14 juin 2019 à 13:48, Adam Kocoloski < > > > > kocol...@apache.org> > > > > > a > > > > > > > > > écrit : > > > > > > > > > > > > > > > > > > > Hi Jérôme, > > > > > > > > > > > > > > > > > > > > Thanks for a well-written and detailed report (though the > > > > mailing > > > > > > > list > > > > > > > > > > strips attachments). The _system endpoint provides a lot > of > > > > > useful > > > > > > > data > > > > > > > > > for > > > > > > > > > > debugging these kinds of situations; do you have a > snapshot > > > of > > > > > the > > > > > > > > output > > > > > > > > > > when the system was consuming a lot of memory? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system > > > > > > > > > > > > > > > > > > > > Adam > > > > > > > > > > > > > > > > > > > > > On Jun 14, 2019, at 5:44 AM, Jérôme Augé < > > > > > > jerome.a...@anakeen.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I'm having a hard time figuring out the high memory > usage > > > of > > > > a > > > > > > > > CouchDB > > > > > > > > > > server. > > > > > > > > > > > > > > > > > > > > > > What I'm observing is that the memory consumption from > > the > > > > > > > "beam.smp" > > > > > > > > > > process gradually rises until it triggers the kernel's > OOM > > > > > > > > > (Out-Of-Memory) > > > > > > > > > > which kill the "beam.smp" process. > > > > > > > > > > > > > > > > > > > > > > It also seems that many databases are not compacted: > I've > > > > made > > > > > a > > > > > > > > script > > > > > > > > > > to iterate over the databases to compute de fragmentation > > > > factor, > > > > > > and > > > > > > > > it > > > > > > > > > > seems I have around 2100 databases with a frag > 70%. > > > > > > > > > > > > > > > > > > > > > > We have a single CouchDB v2.1.1server (configured with > > q=8 > > > > n=1) > > > > > > and > > > > > > > > > > around 2770 databases. > > > > > > > > > > > > > > > > > > > > > > The server initially had 4 GB of RAM, and we are now > with > > > 16 > > > > GB > > > > > > w/ > > > > > > > 8 > > > > > > > > > > vCPU, and it still regularly reaches OOM. From the > > > monitoring I > > > > > see > > > > > > > > that > > > > > > > > > > with 16 GB the OOM is almost triggered once per week > (c.f. > > > > > attached > > > > > > > > > graph). > > > > > > > > > > > > > > > > > > > > > > The memory usage seems to increase gradually until it > > > reaches > > > > > > OOM. > > > > > > > > > > > > > > > > > > > > > > The Couch server is mostly used by web clients with the > > > > PouchDB > > > > > > JS > > > > > > > > API. > > > > > > > > > > > > > > > > > > > > > > We have ~1300 distinct users and by monitoring the > > > > netstat/TCP > > > > > > > > > > established connections I guess we have around 100 > > (maximum) > > > > > users > > > > > > at > > > > > > > > any > > > > > > > > > > given time. From what I understanding of the > application's > > > > logic, > > > > > > > each > > > > > > > > > user > > > > > > > > > > access 2 private databases (read/write) + 1 common > database > > > > > > > > (read-only). > > > > > > > > > > > > > > > > > > > > > > On-disk usage of CouchDB's data directory is around 40 > > GB. > > > > > > > > > > > > > > > > > > > > > > Any ideas on what could cause such behavior (increasing > > > > memory > > > > > > > usage > > > > > > > > > > over the course of a week)? Or how to find what is > > happening > > > > > behind > > > > > > > the > > > > > > > > > > scene? > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > Jérôme > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >