It should be in the run directory. The screenshot here https://framapic.org/jnHAyVEKq98k/kXCQv3pyUdz0.png says it was written. May be double check.
On Thu, Jun 27, 2019 at 1:27 PM Adrien Vergé <adrien.ve...@tolteck.com> wrote: > Vladimir, I would love to have your debugging skills and feedback on this, > thanks for proposing! Unfortunately this only happens on real servers, > after > weeks of continuous real-life requests. In the past, we tried to reproduce > it > on test servers, but the memory leak doesn't happen if CouchDB is not very > active (or it happens, but unnoticeable because too slowly). And these > servers > contain protected data that our rules don't allow us to share. > > I also searched for the crash dump (sudo find / -name '*.dump'; sudo find / > -name 'erl_crash*') but couldn't find it; do you where it could be located? > > We already have swap on these machines. Next time the system comes close > the > OOM point, I will try to see whether they use swap or not. > > Le mer. 26 juin 2019 à 12:43, Vladimir Ralev <vladimir.ra...@gmail.com> a > écrit : > > > Ouch. I have an idea, can you add a bunch of swap on one of those > machines, > > say 20gigs, this should allow the machines to work for a little longer in > > slow mode instead of running out of memory, which will buy you time to > run > > more diagnostics after the incident occurs. This will probably reduce the > > response times a lot though and might break your apps. > > > > Also can you upload that erl_crash.dump file that the crash generated? > > > > PS I would love to get a shell access to a system like that, if you can > > reproduce the issue on a test machine and give me access I should be able > > to come up with something. Free of charge. > > > > On Wed, Jun 26, 2019 at 1:17 PM Adrien Vergé <adrien.ve...@tolteck.com> > > wrote: > > > > > Hi all, > > > > > > Here is more feedback, since one of our CouchDB servers crashed last > > night. > > > > > > - Setup: CouchDB 2.3.1 on a 3-node cluster (q=2 n=3) with ~50k small > > > databases. > > > > > > - Only one of the 3 nodes crashed. Others should crash in a few days > > (they > > > usually crash and restart every ~3 weeks). > > > > > > - update_lru_on_read = false > > > > > > - The extra memory consumption comes from beam.smp process (see graph > > > below). > > > > > > - The crash is an OOM, see the last log lines before restart: > > > > > > eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type > > > "old_heap"). > > > Crash dump is being written to: erl_crash.dump... > > > [os_mon] memory supervisor port (memsup): Erlang has closed > > > [os_mon] cpu supervisor port (cpu_sup): Erlang has closed > > > > > > - Over last weeks, beam.smp memory usage kept increasing and > increasing. > > > See > > > the graph I made at > https://framapic.org/jnHAyVEKq98k/kXCQv3pyUdz0.png > > > > > > - /_node/_local/_system metrics look normal. The difference between an > > > "about > > > to crash" node and a "freshly restarted and lots of free RAM" node, > is > > in > > > uptime, memory.processes_used, memory.binary, context_switches, > > > reductions, > > > garbage_collection_count, io_input... as previously discussed with > > Adam. > > > > > > - This command gives exactly the same output on an "about to crash" > node, > > > than > > > on a "freshly restarted and lots of free RAM" node: > > > > > > MQSizes2 = lists:map(fun(A) -> {_,B} = case > > > process_info(A,total_heap_size) > > > of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} > end, > > > {B,A} > > > end, processes()). > > > > > > Le jeu. 20 juin 2019 à 17:08, Jérôme Augé <jerome.a...@anakeen.com> a > > > écrit : > > > > > > > We are going to plan an upgrade from 2.1.1 to 2.3.1 in the coming > > weeks. > > > > > > > > I have a side question concerning CouchDB's upgrades: is the database > > > > binary compatible between v2.1.1 and v2.3.1? In the case we ever need > > to > > > > downgrade back to 2.1.1, do the binary data can be kept? > > > > > > > > Regards, > > > > Jérôme > > > > > > > > Le mer. 19 juin 2019 à 08:59, Jérôme Augé <jerome.a...@anakeen.com> > a > > > > écrit : > > > > > > > > > Thanks Adam for your explanations! > > > > > > > > > > The "update_lru_on_read" is already set to false on this instance > (I > > > had > > > > > already seen the comments on these pull-requests). > > > > > > > > > > We are effectively running an "old" 2.1.1 version, and we have > > advised > > > > the > > > > > client that an upgrade might be needed to sort out (or further > > > > investigate) > > > > > these problems. > > > > > > > > > > Thanks again, > > > > > Jérôme > > > > > > > > > > > > > > > > > > > > Le mar. 18 juin 2019 à 18:59, Adam Kocoloski <kocol...@apache.org> > a > > > > > écrit : > > > > > > > > > >> Hi Jérôme, definitely useful. > > > > >> > > > > >> The “run_queue” is the number of Erlang processes in a runnable > > state > > > > >> that are not currently executing on a scheduler. When that value > is > > > > greater > > > > >> than zero it means the node is hitting some compute limitations. > > > Seeing > > > > a > > > > >> small positive value from time to time is no problem. > > > > >> > > > > >> Your last six snapshots show a message queue backlog in > > couch_server. > > > > >> That could be what caused the node to OOM. The couch_server > process > > > is a > > > > >> singleton and if it accumulates a large message backlog there are > > > > limited > > > > >> backpressure or scaling mechanisms to help it recover. I noticed > > > you’re > > > > >> running 2.1.1; there were a couple of important enhancements to > > reduce > > > > the > > > > >> message flow through couch_server in more recent releases: > > > > >> > > > > >> 2.2.0: https://github.com/apache/couchdb/pull/1118 < > > > > >> https://github.com/apache/couchdb/pull/1118> > > > > >> 2.3.1: https://github.com/apache/couchdb/pull/1593 < > > > > >> https://github.com/apache/couchdb/pull/1593> > > > > >> > > > > >> The change in 2.2.0 is just a change in the default configuration; > > you > > > > >> can try applying it to your server by setting: > > > > >> > > > > >> [couchdb] > > > > >> update_lru_on_read = false > > > > >> > > > > >> The changes in 2.3.1 offer additional benefits for couch_server > > > message > > > > >> throughput but you’ll need to upgrade to get them. > > > > >> > > > > >> Cheers, Adam > > > > >> > > > > >> P.S. II don’t know what’s going on with the negative memory.other > > > value > > > > >> there, it’s not intentionally meaningful :) > > > > >> > > > > >> > > > > >> > On Jun 18, 2019, at 11:30 AM, Jérôme Augé < > > jerome.a...@anakeen.com> > > > > >> wrote: > > > > >> > > > > > >> > "beam.smp" just got killed by OOM, but I was not in front of the > > > > >> machine to > > > > >> > perform this command... > > > > >> > > > > > >> > However, here is the CouchDB log of "/_node/_local/_system" for > > the > > > 30 > > > > >> > minutes preceding the OOM: > > > > >> > - > > > > >> > > > > > >> > > > > > > > > > > https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log > > > > >> > > > > > >> > I guess the spike that triggers the OOM is so quick (< 1min) > that > > it > > > > >> does > > > > >> > not gets logged (I log every minute). > > > > >> > > > > > >> > Is there anything that can be used/deduced from the last line > > logged > > > > at > > > > >> > 2019-06-18T16:00:14+0200? > > > > >> > > > > > >> > At 15:55:25, the "run_queue" is at 36: what does it means? > Number > > of > > > > >> active > > > > >> > concurrent requests? > > > > >> > > > > > >> > From 15:56 to 16:00 the "memory"."other" value is a negative > > value: > > > > >> does it > > > > >> > means something special? or just an integer overflow? > > > > >> > > > > > >> > > > > > >> > > > > > >> > Le lun. 17 juin 2019 à 14:09, Vladimir Ralev < > > > > vladimir.ra...@gmail.com> > > > > >> a > > > > >> > écrit : > > > > >> > > > > > >> >> Alright, I think the issue will be more visible towards the OOM > > > > point, > > > > >> >> however for now since you have the system live with a leak, it > > will > > > > be > > > > >> >> useful to repeat the same steps, but replace > > > > >> >> "message_queue_len" with "total_heap_size" then with > "heap_size" > > > then > > > > >> with > > > > >> >> "stack_size" and then with "reductions". > > > > >> >> > > > > >> >> For example: > > > > >> >> > > > > >> >> MQSizes2 = lists:map(fun(A) -> {_,B} = case > > > > >> process_info(A,total_heap_size) > > > > >> >> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} > > end, > > > > >> {B,A} > > > > >> >> end, processes()). > > > > >> >> > > > > >> >> Then same with the other params. > > > > >> >> > > > > >> >> That can shed some light, otherwise someone will need to > monitor > > > > >> process > > > > >> >> count and go into them by age and memory patterns. > > > > >> >> > > > > >> >> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé < > > > jerome.a...@anakeen.com > > > > > > > > > >> >> wrote: > > > > >> >> > > > > >> >>> The 2G consumption is from Adrien's system. > > > > >> >>> > > > > >> >>> On mine, since I setup the logging of "/_node/_local/_system" > > > > output : > > > > >> >>> - on june 14th max memory.processes was 2.6 GB > > > > >> >>> - on june 15th max memory.processes was 4.7 GB > > > > >> >>> - on june 16th max memory.processes was 7.0 GB > > > > >> >>> - today (june 17th) max memory.processes was 8.0 GB (and with > an > > > > >> >>> interactive top I see spikes at 12 GB) > > > > >> >>> > > > > >> >>> The memory.processes seems to be steadily increasing over the > > > days, > > > > >> and > > > > >> >> I'm > > > > >> >>> soon expecting the out-of-memory condition to be triggered in > a > > > > >> couple of > > > > >> >>> days. > > > > >> >>> > > > > >> >>> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev < > > > > >> vladimir.ra...@gmail.com> > > > > >> >> a > > > > >> >>> écrit : > > > > >> >>> > > > > >> >>>> Nothing to see here, the message queue stat from Adam's > advice > > is > > > > >> >>> accurate. > > > > >> >>>> Note that you should run this only when there is already an > > > > >> >> unreasonable > > > > >> >>>> amount memory leaked/consumed. > > > > >> >>>> > > > > >> >>>> But now I realise you had "processes":1877591424 before > restart > > > > from > > > > >> >> the > > > > >> >>>> stats above which is less than 2G. Are you using only 2 gigs > of > > > > RAM? > > > > >> I > > > > >> >>> got > > > > >> >>>> confused by the initial comment and I thought you had 15GB > RAM. > > > If > > > > >> you > > > > >> >>> are > > > > >> >>>> only using 2 gigs of RAM, it's probably not enough for your > > > > workload. > > > > >> >>>> > > > > >> >>>> On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé < > > > > >> jerome.a...@anakeen.com> > > > > >> >>>> wrote: > > > > >> >>>> > > > > >> >>>>> That command seems to work, and here is the output: > > > > >> >>>>> > > > > >> >>>>> --8<-- > > > > >> >>>>> # /opt/couchdb/bin/remsh < debug.2.remsh > > > > >> >>>>> Eshell V7.3 (abort with ^G) > > > > >> >>>>> (remsh22574@127.0.0.1)1> [{0,<0.0.0>}, > > > > >> >>>>> {0,<0.3.0>}, > > > > >> >>>>> {0,<0.6.0>}, > > > > >> >>>>> {0,<0.7.0>}, > > > > >> >>>>> {0,<0.9.0>}, > > > > >> >>>>> {0,<0.10.0>}, > > > > >> >>>>> {0,<0.11.0>}, > > > > >> >>>>> {0,<0.12.0>}, > > > > >> >>>>> {0,<0.14.0>}, > > > > >> >>>>> {0,<0.15.0>}, > > > > >> >>>>> {0,<0.16.0>}, > > > > >> >>>>> {0,<0.17.0>}, > > > > >> >>>>> {0,<0.18.0>}, > > > > >> >>>>> {0,<0.19.0>}, > > > > >> >>>>> {0,<0.20.0>}, > > > > >> >>>>> {0,<0.21.0>}, > > > > >> >>>>> {0,<0.22.0>}, > > > > >> >>>>> {0,<0.23.0>}, > > > > >> >>>>> {0,<0.24.0>}, > > > > >> >>>>> {0,<0.25.0>}, > > > > >> >>>>> {0,<0.26.0>}, > > > > >> >>>>> {0,<0.27.0>}, > > > > >> >>>>> {0,<0.28.0>}, > > > > >> >>>>> {0,<0.29.0>}, > > > > >> >>>>> {0,<0.31.0>}, > > > > >> >>>>> {0,<0.32.0>}, > > > > >> >>>>> {0,<0.33.0>}, > > > > >> >>>>> {0,...}, > > > > >> >>>>> {...}] > > > > >> >>>>> (remsh22574@127.0.0.1)2> {0,<0.38.0>} > > > > >> >>>>> (remsh22574@127.0.0.1)3> > > > > [{current_function,{erl_eval,do_apply,6}}, > > > > >> >>>>> {initial_call,{erlang,apply,2}}, > > > > >> >>>>> {status,running}, > > > > >> >>>>> {message_queue_len,0}, > > > > >> >>>>> {messages,[]}, > > > > >> >>>>> {links,[<0.32.0>]}, > > > > >> >>>>> {dictionary,[]}, > > > > >> >>>>> {trap_exit,false}, > > > > >> >>>>> {error_handler,error_handler}, > > > > >> >>>>> {priority,normal}, > > > > >> >>>>> {group_leader,<0.31.0>}, > > > > >> >>>>> {total_heap_size,5172}, > > > > >> >>>>> {heap_size,2586}, > > > > >> >>>>> {stack_size,24}, > > > > >> >>>>> {reductions,24496}, > > > > >> >>>>> {garbage_collection,[{min_bin_vheap_size,46422}, > > > > >> >>>>> {min_heap_size,233}, > > > > >> >>>>> {fullsweep_after,65535}, > > > > >> >>>>> {minor_gcs,1}]}, > > > > >> >>>>> {suspending,[]}] > > > > >> >>>>> (remsh22574@127.0.0.1)4> *** Terminating erlang (' > > > > >> >> remsh22574@127.0.0.1 > > > > >> >>> ') > > > > >> >>>>> -->8-- > > > > >> >>>>> > > > > >> >>>>> What should I be looking for in this output? > > > > >> >>>>> > > > > >> >>>>> Le ven. 14 juin 2019 à 17:30, Vladimir Ralev < > > > > >> >> vladimir.ra...@gmail.com > > > > >> >>>> > > > > >> >>>> a > > > > >> >>>>> écrit : > > > > >> >>>>> > > > > >> >>>>>> That means your couch is creating and destroying processes > > too > > > > >> >>>> rapidly. I > > > > >> >>>>>> haven't seen this, however I think Adam's message_queues > stat > > > > above > > > > >> >>>> does > > > > >> >>>>>> the same thing. I didn't notice you can get it from there. > > > > >> >>>>>> > > > > >> >>>>>> Either way it will be useful if you can get the shell to > > work: > > > > >> >>>>>> Try this command instead for the first, the rest will be > the > > > > same: > > > > >> >>>>>> > > > > >> >>>>>> MQSizes2 = lists:map(fun(A) -> {_,B} = case > > > > >> >>>>>> process_info(A,message_queue_len) of {XV,XB} -> {XV, XB}; > > _ERR > > > -> > > > > >> >>>>>> io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, > processes()). > > > > >> >>>>>> > > > > >> >>>>>> On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé < > > > > >> >> jerome.a...@anakeen.com > > > > >> >>>> > > > > >> >>>>>> wrote: > > > > >> >>>>>> > > > > >> >>>>>>> I tried the following, but it seems to fail on the first > > > > command: > > > > >> >>>>>>> > > > > >> >>>>>>> --8<-- > > > > >> >>>>>>> # /opt/couchdb/bin/remsh > > > > >> >>>>>>> Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] > [smp:8:8] > > > > >> >>>>>>> [async-threads:10] [hipe] [kernel-poll:false] > > > > >> >>>>>>> > > > > >> >>>>>>> Eshell V7.3 (abort with ^G) > > > > >> >>>>>>> (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A) -> > {_,B} > > = > > > > >> >>>>>>> process_info(A,message_queue_len), {B,A} end, > processes()). > > > > >> >>>>>>> ** exception error: no match of right hand side value > > > undefined > > > > >> >>>>>>> -->8-- > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> Le ven. 14 juin 2019 à 16:08, Vladimir Ralev < > > > > >> >>>> vladimir.ra...@gmail.com > > > > >> >>>>>> > > > > >> >>>>>> a > > > > >> >>>>>>> écrit : > > > > >> >>>>>>> > > > > >> >>>>>>>> Hey guys. I bet it's a mailbox leaking memory. I am very > > > > >> >>> interested > > > > >> >>>>> in > > > > >> >>>>>>>> debugging issues like this too. > > > > >> >>>>>>>> > > > > >> >>>>>>>> I can suggest to get an erlang shell and run these > commands > > > to > > > > >> >>> see > > > > >> >>>>> the > > > > >> >>>>>>> top > > > > >> >>>>>>>> memory consuming processes > > > > >> >>>>>>>> > > > > >> >>> > > > https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html > > > > >> >>>>>>>> > > > > >> >>>>>>>> One issue I will be reporting soon is if one of your > nodes > > is > > > > >> >>> down > > > > >> >>>>> for > > > > >> >>>>>>> some > > > > >> >>>>>>>> amount of time, it seems like all databases independently > > try > > > > >> >> and > > > > >> >>>>> retry > > > > >> >>>>>>> to > > > > >> >>>>>>>> query the missing node and fail, resulting in printing a > > lot > > > of > > > > >> >>>> logs > > > > >> >>>>>> for > > > > >> >>>>>>>> each db which can overwhelm the logger process. If you > > have a > > > > >> >> lot > > > > >> >>>> of > > > > >> >>>>>> DBs > > > > >> >>>>>>>> this makes the problem worse, but it doesn't happen right > > > away > > > > >> >>> for > > > > >> >>>>> some > > > > >> >>>>>>>> reason. > > > > >> >>>>>>>> > > > > >> >>>>>>>> On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé < > > > > >> >>>>> adrien.ve...@tolteck.com > > > > >> >>>>>>> > > > > >> >>>>>>>> wrote: > > > > >> >>>>>>>> > > > > >> >>>>>>>>> Hi Jérôme and Adam, > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> That's funny, because I'm investigating the exact same > > > > >> >> problem > > > > >> >>>>> these > > > > >> >>>>>>>> days. > > > > >> >>>>>>>>> We have a two CouchDB setups: > > > > >> >>>>>>>>> - a one-node server (q=2 n=1) with 5000 databases > > > > >> >>>>>>>>> - a 3-node cluster (q=2 n=3) with 50000 databases > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> ... and we are experiencing the problem on both setups. > > > We've > > > > >> >>>> been > > > > >> >>>>>>> having > > > > >> >>>>>>>>> this problem for at least 3-4 months. > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> We've monitored: > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> - The number of open files: it's relatively low (both > the > > > > >> >>>> system's > > > > >> >>>>>>> total > > > > >> >>>>>>>>> and or fds opened by beam.smp). > > > > >> >>>>>>>>> https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> - The usage of RAM, total used and used by beam.smp > > > > >> >>>>>>>>> https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png > > > > >> >>>>>>>>> It continuously grows, with regular spikes, until > killing > > > > >> >>>> CouchDB > > > > >> >>>>>>> with > > > > >> >>>>>>>> an > > > > >> >>>>>>>>> OOM. After restart, the RAM usage is nice and low, and > no > > > > >> >>> spikes. > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> - /_node/_local/_system metrics, before and after > restart. > > > > >> >>> Values > > > > >> >>>>>> that > > > > >> >>>>>>>>> significantly differ (before / after restart) are listed > > > > >> >> here: > > > > >> >>>>>>>>> - uptime (obviously ;-)) > > > > >> >>>>>>>>> - memory.processes : + 3732 % > > > > >> >>>>>>>>> - memory.processes_used : + 3735 % > > > > >> >>>>>>>>> - memory.binary : + 17700 % > > > > >> >>>>>>>>> - context_switches : + 17376 % > > > > >> >>>>>>>>> - reductions : + 867832 % > > > > >> >>>>>>>>> - garbage_collection_count : + 448248 % > > > > >> >>>>>>>>> - words_reclaimed : + 112755 % > > > > >> >>>>>>>>> - io_input : + 44226 % > > > > >> >>>>>>>>> - io_output : + 157951 % > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> Before CouchDB restart: > > > > >> >>>>>>>>> { > > > > >> >>>>>>>>> "uptime":2712973, > > > > >> >>>>>>>>> "memory":{ > > > > >> >>>>>>>>> "other":7250289, > > > > >> >>>>>>>>> "atom":512625, > > > > >> >>>>>>>>> "atom_used":510002, > > > > >> >>>>>>>>> "processes":1877591424, > > > > >> >>>>>>>>> "processes_used":1877504920, > > > > >> >>>>>>>>> "binary":177468848, > > > > >> >>>>>>>>> "code":9653286, > > > > >> >>>>>>>>> "ets":16012736 > > > > >> >>>>>>>>> }, > > > > >> >>>>>>>>> "run_queue":0, > > > > >> >>>>>>>>> "ets_table_count":102, > > > > >> >>>>>>>>> "context_switches":1621495509, > > > > >> >>>>>>>>> "reductions":968705947589, > > > > >> >>>>>>>>> "garbage_collection_count":331826928, > > > > >> >>>>>>>>> "words_reclaimed":269964293572, > > > > >> >>>>>>>>> "io_input":8812455, > > > > >> >>>>>>>>> "io_output":20733066, > > > > >> >>>>>>>>> ... > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> After CouchDB restart: > > > > >> >>>>>>>>> { > > > > >> >>>>>>>>> "uptime":206, > > > > >> >>>>>>>>> "memory":{ > > > > >> >>>>>>>>> "other":6907493, > > > > >> >>>>>>>>> "atom":512625, > > > > >> >>>>>>>>> "atom_used":497769, > > > > >> >>>>>>>>> "processes":49001944, > > > > >> >>>>>>>>> "processes_used":48963168, > > > > >> >>>>>>>>> "binary":997032, > > > > >> >>>>>>>>> "code":9233842, > > > > >> >>>>>>>>> "ets":4779576 > > > > >> >>>>>>>>> }, > > > > >> >>>>>>>>> "run_queue":0, > > > > >> >>>>>>>>> "ets_table_count":102, > > > > >> >>>>>>>>> "context_switches":1015486, > > > > >> >>>>>>>>> "reductions":111610788, > > > > >> >>>>>>>>> "garbage_collection_count":74011, > > > > >> >>>>>>>>> "words_reclaimed":239214127, > > > > >> >>>>>>>>> "io_input":19881, > > > > >> >>>>>>>>> "io_output":13118, > > > > >> >>>>>>>>> ... > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> Adrien > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> Le ven. 14 juin 2019 à 15:11, Jérôme Augé < > > > > >> >>>> jerome.a...@anakeen.com > > > > >> >>>>>> > > > > >> >>>>>> a > > > > >> >>>>>>>>> écrit : > > > > >> >>>>>>>>> > > > > >> >>>>>>>>>> Ok, so I'll setup a cron job to journalize (every > > minute?) > > > > >> >>> the > > > > >> >>>>>> output > > > > >> >>>>>>>>> from > > > > >> >>>>>>>>>> "/_node/_local/_system" and wait for the next OOM kill. > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>>> Any property from "_system" to look for in particular? > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>>> Here is a link to the memory usage graph: > > > > >> >>>>>>>>>> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>>> The memory usage varies, but the general trend is to go > > up > > > > >> >>> with > > > > >> >>>>>> some > > > > >> >>>>>>>>>> regularity over a week until we reach OOM. When > > "beam.smp" > > > > >> >> is > > > > >> >>>>>> killed, > > > > >> >>>>>>>>> it's > > > > >> >>>>>>>>>> reported as consuming 15 GB (as seen in the kernel's > OOM > > > > >> >>> trace > > > > >> >>>> in > > > > >> >>>>>>>>> syslog). > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>>> Thanks, > > > > >> >>>>>>>>>> Jérôme > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>>> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski < > > > > >> >>>>> kocol...@apache.org> > > > > >> >>>>>> a > > > > >> >>>>>>>>>> écrit : > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>>>> Hi Jérôme, > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>>> Thanks for a well-written and detailed report (though > > the > > > > >> >>>>> mailing > > > > >> >>>>>>>> list > > > > >> >>>>>>>>>>> strips attachments). The _system endpoint provides a > lot > > > > >> >> of > > > > >> >>>>>> useful > > > > >> >>>>>>>> data > > > > >> >>>>>>>>>> for > > > > >> >>>>>>>>>>> debugging these kinds of situations; do you have a > > > > >> >> snapshot > > > > >> >>>> of > > > > >> >>>>>> the > > > > >> >>>>>>>>> output > > > > >> >>>>>>>>>>> when the system was consuming a lot of memory? > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>> > > > > >> >>>>> > > > > >> >>>> > > > > >> >>> > > > > >> >> > > > > >> > > > > > > > > > > http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>>> Adam > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé < > > > > >> >>>>>>> jerome.a...@anakeen.com> > > > > >> >>>>>>>>>>> wrote: > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> Hi, > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> I'm having a hard time figuring out the high memory > > > > >> >> usage > > > > >> >>>> of > > > > >> >>>>> a > > > > >> >>>>>>>>> CouchDB > > > > >> >>>>>>>>>>> server. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> What I'm observing is that the memory consumption > from > > > > >> >>> the > > > > >> >>>>>>>> "beam.smp" > > > > >> >>>>>>>>>>> process gradually rises until it triggers the kernel's > > > > >> >> OOM > > > > >> >>>>>>>>>> (Out-Of-Memory) > > > > >> >>>>>>>>>>> which kill the "beam.smp" process. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> It also seems that many databases are not compacted: > > > > >> >> I've > > > > >> >>>>> made > > > > >> >>>>>> a > > > > >> >>>>>>>>> script > > > > >> >>>>>>>>>>> to iterate over the databases to compute de > > fragmentation > > > > >> >>>>> factor, > > > > >> >>>>>>> and > > > > >> >>>>>>>>> it > > > > >> >>>>>>>>>>> seems I have around 2100 databases with a frag > 70%. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> We have a single CouchDB v2.1.1server (configured > with > > > > >> >>> q=8 > > > > >> >>>>> n=1) > > > > >> >>>>>>> and > > > > >> >>>>>>>>>>> around 2770 databases. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> The server initially had 4 GB of RAM, and we are now > > > > >> >> with > > > > >> >>>> 16 > > > > >> >>>>> GB > > > > >> >>>>>>> w/ > > > > >> >>>>>>>> 8 > > > > >> >>>>>>>>>>> vCPU, and it still regularly reaches OOM. From the > > > > >> >>>> monitoring I > > > > >> >>>>>> see > > > > >> >>>>>>>>> that > > > > >> >>>>>>>>>>> with 16 GB the OOM is almost triggered once per week > > > > >> >> (c.f. > > > > >> >>>>>> attached > > > > >> >>>>>>>>>> graph). > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> The memory usage seems to increase gradually until it > > > > >> >>>> reaches > > > > >> >>>>>>> OOM. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> The Couch server is mostly used by web clients with > the > > > > >> >>>>> PouchDB > > > > >> >>>>>>> JS > > > > >> >>>>>>>>> API. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> We have ~1300 distinct users and by monitoring the > > > > >> >>>>> netstat/TCP > > > > >> >>>>>>>>>>> established connections I guess we have around 100 > > > > >> >>> (maximum) > > > > >> >>>>>> users > > > > >> >>>>>>> at > > > > >> >>>>>>>>> any > > > > >> >>>>>>>>>>> given time. From what I understanding of the > > > > >> >> application's > > > > >> >>>>> logic, > > > > >> >>>>>>>> each > > > > >> >>>>>>>>>> user > > > > >> >>>>>>>>>>> access 2 private databases (read/write) + 1 common > > > > >> >> database > > > > >> >>>>>>>>> (read-only). > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> On-disk usage of CouchDB's data directory is around > 40 > > > > >> >>> GB. > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> Any ideas on what could cause such behavior > (increasing > > > > >> >>>>> memory > > > > >> >>>>>>>> usage > > > > >> >>>>>>>>>>> over the course of a week)? Or how to find what is > > > > >> >>> happening > > > > >> >>>>>> behind > > > > >> >>>>>>>> the > > > > >> >>>>>>>>>>> scene? > > > > >> >>>>>>>>>>>> > > > > >> >>>>>>>>>>>> Regards, > > > > >> >>>>>>>>>>>> Jérôme > > > > >> >>>>>>>>>>> > > > > >> >>>>>>>>>> > > > > >> >>>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>> > > > > >> >>>>> > > > > >> >>>> > > > > >> >>> > > > > >> >> > > > > >> > > > > >> > > > > > > > > > >