I have no "erl_crash.dump" neither, but I guess it's because the erlang
process is hard-killed by the kernel, and it does not have time for dumping
its state...

@Adrien : what version of CouchDB are you using ?


Le jeu. 27 juin 2019 à 12:27, Adrien Vergé <adrien.ve...@tolteck.com> a
écrit :

> Vladimir, I would love to have your debugging skills and feedback on this,
> thanks for proposing! Unfortunately this only happens on real servers,
> after
> weeks of continuous real-life requests. In the past, we tried to reproduce
> it
> on test servers, but the memory leak doesn't happen if CouchDB is not very
> active (or it happens, but unnoticeable because too slowly). And these
> servers
> contain protected data that our rules don't allow us to share.
>
> I also searched for the crash dump (sudo find / -name '*.dump'; sudo find /
> -name 'erl_crash*') but couldn't find it; do you where it could be located?
>
> We already have swap on these machines. Next time the system comes close
> the
> OOM point, I will try to see whether they use swap or not.
>
> Le mer. 26 juin 2019 à 12:43, Vladimir Ralev <vladimir.ra...@gmail.com> a
> écrit :
>
> > Ouch. I have an idea, can you add a bunch of swap on one of those
> machines,
> > say 20gigs, this should allow the machines to work for a little longer in
> > slow mode instead of running out of memory, which will buy you time to
> run
> > more diagnostics after the incident occurs. This will probably reduce the
> > response times a lot though and might break your apps.
> >
> > Also can you upload that erl_crash.dump file that the crash generated?
> >
> > PS I would love to get a shell access to a system like that, if you can
> > reproduce the issue on a test machine and give me access I should be able
> > to come up with something. Free of charge.
> >
> > On Wed, Jun 26, 2019 at 1:17 PM Adrien Vergé <adrien.ve...@tolteck.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > Here is more feedback, since one of our CouchDB servers crashed last
> > night.
> > >
> > > - Setup: CouchDB 2.3.1 on a 3-node cluster (q=2 n=3) with ~50k small
> > > databases.
> > >
> > > - Only one of the 3 nodes crashed. Others should crash in a few days
> > (they
> > >   usually crash and restart every ~3 weeks).
> > >
> > > - update_lru_on_read = false
> > >
> > > - The extra memory consumption comes from beam.smp process (see graph
> > > below).
> > >
> > > - The crash is an OOM, see the last log lines before restart:
> > >
> > >       eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type
> > > "old_heap").
> > >       Crash dump is being written to: erl_crash.dump...
> > >       [os_mon] memory supervisor port (memsup): Erlang has closed
> > >       [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
> > >
> > > - Over last weeks, beam.smp memory usage kept increasing and
> increasing.
> > > See
> > >   the graph I made at
> https://framapic.org/jnHAyVEKq98k/kXCQv3pyUdz0.png
> > >
> > > - /_node/_local/_system metrics look normal. The difference between an
> > > "about
> > >   to crash" node and a "freshly restarted and lots of free RAM" node,
> is
> > in
> > >   uptime, memory.processes_used, memory.binary, context_switches,
> > > reductions,
> > >   garbage_collection_count, io_input... as previously discussed with
> > Adam.
> > >
> > > - This command gives exactly the same output on an "about to crash"
> node,
> > > than
> > >   on a "freshly restarted and lots of free RAM" node:
> > >
> > >       MQSizes2 = lists:map(fun(A) -> {_,B} = case
> > > process_info(A,total_heap_size)
> > >       of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0}
> end,
> > > {B,A}
> > >       end, processes()).
> > >
> > > Le jeu. 20 juin 2019 à 17:08, Jérôme Augé <jerome.a...@anakeen.com> a
> > > écrit :
> > >
> > > > We are going to plan an upgrade from 2.1.1 to 2.3.1 in the coming
> > weeks.
> > > >
> > > > I have a side question concerning CouchDB's upgrades: is the database
> > > > binary compatible between v2.1.1 and v2.3.1? In the case we ever need
> > to
> > > > downgrade back to 2.1.1, do the binary data can be kept?
> > > >
> > > > Regards,
> > > > Jérôme
> > > >
> > > > Le mer. 19 juin 2019 à 08:59, Jérôme Augé <jerome.a...@anakeen.com>
> a
> > > > écrit :
> > > >
> > > > > Thanks Adam for your explanations!
> > > > >
> > > > > The "update_lru_on_read" is already set to false on this instance
> (I
> > > had
> > > > > already seen the comments on these pull-requests).
> > > > >
> > > > > We are effectively running an "old" 2.1.1 version, and we have
> > advised
> > > > the
> > > > > client that an upgrade might be needed to sort out (or further
> > > > investigate)
> > > > > these problems.
> > > > >
> > > > > Thanks again,
> > > > > Jérôme
> > > > >
> > > > >
> > > > >
> > > > > Le mar. 18 juin 2019 à 18:59, Adam Kocoloski <kocol...@apache.org>
> a
> > > > > écrit :
> > > > >
> > > > >> Hi Jérôme, definitely useful.
> > > > >>
> > > > >> The “run_queue” is the number of Erlang processes in a runnable
> > state
> > > > >> that are not currently executing on a scheduler. When that value
> is
> > > > greater
> > > > >> than zero it means the node is hitting some compute limitations.
> > > Seeing
> > > > a
> > > > >> small positive value from time to time is no problem.
> > > > >>
> > > > >> Your last six snapshots show a message queue backlog in
> > couch_server.
> > > > >> That could be what caused the node to OOM. The couch_server
> process
> > > is a
> > > > >> singleton and if it accumulates a large message backlog there are
> > > > limited
> > > > >> backpressure or scaling mechanisms to help it recover. I noticed
> > > you’re
> > > > >> running 2.1.1; there were a couple of important enhancements to
> > reduce
> > > > the
> > > > >> message flow through couch_server in more recent releases:
> > > > >>
> > > > >> 2.2.0: https://github.com/apache/couchdb/pull/1118 <
> > > > >> https://github.com/apache/couchdb/pull/1118>
> > > > >> 2.3.1: https://github.com/apache/couchdb/pull/1593 <
> > > > >> https://github.com/apache/couchdb/pull/1593>
> > > > >>
> > > > >> The change in 2.2.0 is just a change in the default configuration;
> > you
> > > > >> can try applying it to your server by setting:
> > > > >>
> > > > >> [couchdb]
> > > > >> update_lru_on_read = false
> > > > >>
> > > > >> The changes in 2.3.1 offer additional benefits for couch_server
> > > message
> > > > >> throughput but you’ll need to upgrade to get them.
> > > > >>
> > > > >> Cheers, Adam
> > > > >>
> > > > >> P.S. II don’t know what’s going on with the negative memory.other
> > > value
> > > > >> there, it’s not intentionally meaningful :)
> > > > >>
> > > > >>
> > > > >> > On Jun 18, 2019, at 11:30 AM, Jérôme Augé <
> > jerome.a...@anakeen.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > "beam.smp" just got killed by OOM, but I was not in front of the
> > > > >> machine to
> > > > >> > perform this command...
> > > > >> >
> > > > >> > However, here is the CouchDB log of "/_node/_local/_system" for
> > the
> > > 30
> > > > >> > minutes preceding the OOM:
> > > > >> > -
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log
> > > > >> >
> > > > >> > I guess the spike that triggers the OOM is so quick (< 1min)
> that
> > it
> > > > >> does
> > > > >> > not gets logged (I log every minute).
> > > > >> >
> > > > >> > Is there anything that can be used/deduced from the last line
> > logged
> > > > at
> > > > >> > 2019-06-18T16:00:14+0200?
> > > > >> >
> > > > >> > At 15:55:25, the "run_queue" is at 36: what does it means?
> Number
> > of
> > > > >> active
> > > > >> > concurrent requests?
> > > > >> >
> > > > >> > From 15:56 to 16:00 the "memory"."other" value is a negative
> > value:
> > > > >> does it
> > > > >> > means something special? or just an integer overflow?
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Le lun. 17 juin 2019 à 14:09, Vladimir Ralev <
> > > > vladimir.ra...@gmail.com>
> > > > >> a
> > > > >> > écrit :
> > > > >> >
> > > > >> >> Alright, I think the issue will be more visible towards the OOM
> > > > point,
> > > > >> >> however for now since you have the system live with a leak, it
> > will
> > > > be
> > > > >> >> useful to repeat the same steps, but replace
> > > > >> >> "message_queue_len" with "total_heap_size" then with
> "heap_size"
> > > then
> > > > >> with
> > > > >> >> "stack_size" and then with "reductions".
> > > > >> >>
> > > > >> >> For example:
> > > > >> >>
> > > > >> >> MQSizes2 = lists:map(fun(A) -> {_,B} = case
> > > > >> process_info(A,total_heap_size)
> > > > >> >> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0}
> > end,
> > > > >> {B,A}
> > > > >> >> end, processes()).
> > > > >> >>
> > > > >> >> Then same with the other params.
> > > > >> >>
> > > > >> >> That can shed some light, otherwise someone will need to
> monitor
> > > > >> process
> > > > >> >> count and go into them by age and memory patterns.
> > > > >> >>
> > > > >> >> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <
> > > jerome.a...@anakeen.com
> > > > >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >>> The 2G consumption is from Adrien's system.
> > > > >> >>>
> > > > >> >>> On mine, since I setup the logging of "/_node/_local/_system"
> > > > output :
> > > > >> >>> - on june 14th max memory.processes was 2.6 GB
> > > > >> >>> - on june 15th max memory.processes was 4.7 GB
> > > > >> >>> - on june 16th max memory.processes was 7.0 GB
> > > > >> >>> - today (june 17th) max memory.processes was 8.0 GB (and with
> an
> > > > >> >>> interactive top I see spikes at 12 GB)
> > > > >> >>>
> > > > >> >>> The memory.processes seems to be steadily increasing over the
> > > days,
> > > > >> and
> > > > >> >> I'm
> > > > >> >>> soon expecting the out-of-memory condition to be triggered in
> a
> > > > >> couple of
> > > > >> >>> days.
> > > > >> >>>
> > > > >> >>> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <
> > > > >> vladimir.ra...@gmail.com>
> > > > >> >> a
> > > > >> >>> écrit :
> > > > >> >>>
> > > > >> >>>> Nothing to see here, the message queue stat from Adam's
> advice
> > is
> > > > >> >>> accurate.
> > > > >> >>>> Note that you should run this only when there is already an
> > > > >> >> unreasonable
> > > > >> >>>> amount memory leaked/consumed.
> > > > >> >>>>
> > > > >> >>>> But now I realise you had "processes":1877591424 before
> restart
> > > > from
> > > > >> >> the
> > > > >> >>>> stats above which is less than 2G. Are you using only 2 gigs
> of
> > > > RAM?
> > > > >> I
> > > > >> >>> got
> > > > >> >>>> confused by the initial comment and I thought you had 15GB
> RAM.
> > > If
> > > > >> you
> > > > >> >>> are
> > > > >> >>>> only using 2 gigs of RAM, it's probably not enough for your
> > > > workload.
> > > > >> >>>>
> > > > >> >>>> On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <
> > > > >> jerome.a...@anakeen.com>
> > > > >> >>>> wrote:
> > > > >> >>>>
> > > > >> >>>>> That command seems to work, and here is the output:
> > > > >> >>>>>
> > > > >> >>>>> --8<--
> > > > >> >>>>> # /opt/couchdb/bin/remsh < debug.2.remsh
> > > > >> >>>>> Eshell V7.3  (abort with ^G)
> > > > >> >>>>> (remsh22574@127.0.0.1)1> [{0,<0.0.0>},
> > > > >> >>>>> {0,<0.3.0>},
> > > > >> >>>>> {0,<0.6.0>},
> > > > >> >>>>> {0,<0.7.0>},
> > > > >> >>>>> {0,<0.9.0>},
> > > > >> >>>>> {0,<0.10.0>},
> > > > >> >>>>> {0,<0.11.0>},
> > > > >> >>>>> {0,<0.12.0>},
> > > > >> >>>>> {0,<0.14.0>},
> > > > >> >>>>> {0,<0.15.0>},
> > > > >> >>>>> {0,<0.16.0>},
> > > > >> >>>>> {0,<0.17.0>},
> > > > >> >>>>> {0,<0.18.0>},
> > > > >> >>>>> {0,<0.19.0>},
> > > > >> >>>>> {0,<0.20.0>},
> > > > >> >>>>> {0,<0.21.0>},
> > > > >> >>>>> {0,<0.22.0>},
> > > > >> >>>>> {0,<0.23.0>},
> > > > >> >>>>> {0,<0.24.0>},
> > > > >> >>>>> {0,<0.25.0>},
> > > > >> >>>>> {0,<0.26.0>},
> > > > >> >>>>> {0,<0.27.0>},
> > > > >> >>>>> {0,<0.28.0>},
> > > > >> >>>>> {0,<0.29.0>},
> > > > >> >>>>> {0,<0.31.0>},
> > > > >> >>>>> {0,<0.32.0>},
> > > > >> >>>>> {0,<0.33.0>},
> > > > >> >>>>> {0,...},
> > > > >> >>>>> {...}]
> > > > >> >>>>> (remsh22574@127.0.0.1)2> {0,<0.38.0>}
> > > > >> >>>>> (remsh22574@127.0.0.1)3>
> > > > [{current_function,{erl_eval,do_apply,6}},
> > > > >> >>>>> {initial_call,{erlang,apply,2}},
> > > > >> >>>>> {status,running},
> > > > >> >>>>> {message_queue_len,0},
> > > > >> >>>>> {messages,[]},
> > > > >> >>>>> {links,[<0.32.0>]},
> > > > >> >>>>> {dictionary,[]},
> > > > >> >>>>> {trap_exit,false},
> > > > >> >>>>> {error_handler,error_handler},
> > > > >> >>>>> {priority,normal},
> > > > >> >>>>> {group_leader,<0.31.0>},
> > > > >> >>>>> {total_heap_size,5172},
> > > > >> >>>>> {heap_size,2586},
> > > > >> >>>>> {stack_size,24},
> > > > >> >>>>> {reductions,24496},
> > > > >> >>>>> {garbage_collection,[{min_bin_vheap_size,46422},
> > > > >> >>>>>                      {min_heap_size,233},
> > > > >> >>>>>                      {fullsweep_after,65535},
> > > > >> >>>>>                      {minor_gcs,1}]},
> > > > >> >>>>> {suspending,[]}]
> > > > >> >>>>> (remsh22574@127.0.0.1)4> *** Terminating erlang ('
> > > > >> >> remsh22574@127.0.0.1
> > > > >> >>> ')
> > > > >> >>>>> -->8--
> > > > >> >>>>>
> > > > >> >>>>> What should I be looking for in this output?
> > > > >> >>>>>
> > > > >> >>>>> Le ven. 14 juin 2019 à 17:30, Vladimir Ralev <
> > > > >> >> vladimir.ra...@gmail.com
> > > > >> >>>>
> > > > >> >>>> a
> > > > >> >>>>> écrit :
> > > > >> >>>>>
> > > > >> >>>>>> That means your couch is creating and destroying processes
> > too
> > > > >> >>>> rapidly. I
> > > > >> >>>>>> haven't seen this, however I think Adam's message_queues
> stat
> > > > above
> > > > >> >>>> does
> > > > >> >>>>>> the same thing. I didn't notice you can get it from there.
> > > > >> >>>>>>
> > > > >> >>>>>> Either way it will be useful if you can get the shell to
> > work:
> > > > >> >>>>>> Try this command instead for the first, the rest will be
> the
> > > > same:
> > > > >> >>>>>>
> > > > >> >>>>>> MQSizes2 = lists:map(fun(A) -> {_,B} = case
> > > > >> >>>>>> process_info(A,message_queue_len) of {XV,XB} -> {XV, XB};
> > _ERR
> > > ->
> > > > >> >>>>>> io:format("~p",[_ERR]),{ok, 0} end, {B,A} end,
> processes()).
> > > > >> >>>>>>
> > > > >> >>>>>> On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé <
> > > > >> >> jerome.a...@anakeen.com
> > > > >> >>>>
> > > > >> >>>>>> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>>> I tried the following, but it seems to fail on the first
> > > > command:
> > > > >> >>>>>>>
> > > > >> >>>>>>> --8<--
> > > > >> >>>>>>> # /opt/couchdb/bin/remsh
> > > > >> >>>>>>> Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit]
> [smp:8:8]
> > > > >> >>>>>>> [async-threads:10] [hipe] [kernel-poll:false]
> > > > >> >>>>>>>
> > > > >> >>>>>>> Eshell V7.3  (abort with ^G)
> > > > >> >>>>>>> (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A) ->
> {_,B}
> > =
> > > > >> >>>>>>> process_info(A,message_queue_len), {B,A} end,
> processes()).
> > > > >> >>>>>>> ** exception error: no match of right hand side value
> > > undefined
> > > > >> >>>>>>> -->8--
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> Le ven. 14 juin 2019 à 16:08, Vladimir Ralev <
> > > > >> >>>> vladimir.ra...@gmail.com
> > > > >> >>>>>>
> > > > >> >>>>>> a
> > > > >> >>>>>>> écrit :
> > > > >> >>>>>>>
> > > > >> >>>>>>>> Hey guys. I bet it's a mailbox leaking memory. I am very
> > > > >> >>> interested
> > > > >> >>>>> in
> > > > >> >>>>>>>> debugging issues like this too.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> I can suggest to get an erlang shell and run these
> commands
> > > to
> > > > >> >>> see
> > > > >> >>>>> the
> > > > >> >>>>>>> top
> > > > >> >>>>>>>> memory consuming processes
> > > > >> >>>>>>>>
> > > > >> >>>
> > > https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> One issue I will be reporting soon is if one of your
> nodes
> > is
> > > > >> >>> down
> > > > >> >>>>> for
> > > > >> >>>>>>> some
> > > > >> >>>>>>>> amount of time, it seems like all databases independently
> > try
> > > > >> >> and
> > > > >> >>>>> retry
> > > > >> >>>>>>> to
> > > > >> >>>>>>>> query the missing node and fail, resulting in printing a
> > lot
> > > of
> > > > >> >>>> logs
> > > > >> >>>>>> for
> > > > >> >>>>>>>> each db which can overwhelm the logger process. If you
> > have a
> > > > >> >> lot
> > > > >> >>>> of
> > > > >> >>>>>> DBs
> > > > >> >>>>>>>> this makes the problem worse, but it doesn't happen right
> > > away
> > > > >> >>> for
> > > > >> >>>>> some
> > > > >> >>>>>>>> reason.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé <
> > > > >> >>>>> adrien.ve...@tolteck.com
> > > > >> >>>>>>>
> > > > >> >>>>>>>> wrote:
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>> Hi Jérôme and Adam,
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> That's funny, because I'm investigating the exact same
> > > > >> >> problem
> > > > >> >>>>> these
> > > > >> >>>>>>>> days.
> > > > >> >>>>>>>>> We have a two CouchDB setups:
> > > > >> >>>>>>>>> - a one-node server (q=2 n=1) with 5000 databases
> > > > >> >>>>>>>>> - a 3-node cluster (q=2 n=3) with 50000 databases
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> ... and we are experiencing the problem on both setups.
> > > We've
> > > > >> >>>> been
> > > > >> >>>>>>> having
> > > > >> >>>>>>>>> this problem for at least 3-4 months.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> We've monitored:
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> - The number of open files: it's relatively low (both
> the
> > > > >> >>>> system's
> > > > >> >>>>>>> total
> > > > >> >>>>>>>>> and or fds opened by beam.smp).
> > > > >> >>>>>>>>>  https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> - The usage of RAM, total used and used by beam.smp
> > > > >> >>>>>>>>>  https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
> > > > >> >>>>>>>>>  It continuously grows, with regular spikes, until
> killing
> > > > >> >>>> CouchDB
> > > > >> >>>>>>> with
> > > > >> >>>>>>>> an
> > > > >> >>>>>>>>> OOM. After restart, the RAM usage is nice and low, and
> no
> > > > >> >>> spikes.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> - /_node/_local/_system metrics, before and after
> restart.
> > > > >> >>> Values
> > > > >> >>>>>> that
> > > > >> >>>>>>>>> significantly differ (before / after restart) are listed
> > > > >> >> here:
> > > > >> >>>>>>>>>  - uptime (obviously ;-))
> > > > >> >>>>>>>>>  - memory.processes : + 3732 %
> > > > >> >>>>>>>>>  - memory.processes_used : + 3735 %
> > > > >> >>>>>>>>>  - memory.binary : + 17700 %
> > > > >> >>>>>>>>>  - context_switches : + 17376 %
> > > > >> >>>>>>>>>  - reductions : + 867832 %
> > > > >> >>>>>>>>>  - garbage_collection_count : + 448248 %
> > > > >> >>>>>>>>>  - words_reclaimed : + 112755 %
> > > > >> >>>>>>>>>  - io_input : + 44226 %
> > > > >> >>>>>>>>>  - io_output : + 157951 %
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Before CouchDB restart:
> > > > >> >>>>>>>>> {
> > > > >> >>>>>>>>>  "uptime":2712973,
> > > > >> >>>>>>>>>  "memory":{
> > > > >> >>>>>>>>>    "other":7250289,
> > > > >> >>>>>>>>>    "atom":512625,
> > > > >> >>>>>>>>>    "atom_used":510002,
> > > > >> >>>>>>>>>    "processes":1877591424,
> > > > >> >>>>>>>>>    "processes_used":1877504920,
> > > > >> >>>>>>>>>    "binary":177468848,
> > > > >> >>>>>>>>>    "code":9653286,
> > > > >> >>>>>>>>>    "ets":16012736
> > > > >> >>>>>>>>>  },
> > > > >> >>>>>>>>>  "run_queue":0,
> > > > >> >>>>>>>>>  "ets_table_count":102,
> > > > >> >>>>>>>>>  "context_switches":1621495509,
> > > > >> >>>>>>>>>  "reductions":968705947589,
> > > > >> >>>>>>>>>  "garbage_collection_count":331826928,
> > > > >> >>>>>>>>>  "words_reclaimed":269964293572,
> > > > >> >>>>>>>>>  "io_input":8812455,
> > > > >> >>>>>>>>>  "io_output":20733066,
> > > > >> >>>>>>>>>  ...
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> After CouchDB restart:
> > > > >> >>>>>>>>> {
> > > > >> >>>>>>>>>  "uptime":206,
> > > > >> >>>>>>>>>  "memory":{
> > > > >> >>>>>>>>>    "other":6907493,
> > > > >> >>>>>>>>>    "atom":512625,
> > > > >> >>>>>>>>>    "atom_used":497769,
> > > > >> >>>>>>>>>    "processes":49001944,
> > > > >> >>>>>>>>>    "processes_used":48963168,
> > > > >> >>>>>>>>>    "binary":997032,
> > > > >> >>>>>>>>>    "code":9233842,
> > > > >> >>>>>>>>>    "ets":4779576
> > > > >> >>>>>>>>>  },
> > > > >> >>>>>>>>>  "run_queue":0,
> > > > >> >>>>>>>>>  "ets_table_count":102,
> > > > >> >>>>>>>>>  "context_switches":1015486,
> > > > >> >>>>>>>>>  "reductions":111610788,
> > > > >> >>>>>>>>>  "garbage_collection_count":74011,
> > > > >> >>>>>>>>>  "words_reclaimed":239214127,
> > > > >> >>>>>>>>>  "io_input":19881,
> > > > >> >>>>>>>>>  "io_output":13118,
> > > > >> >>>>>>>>>  ...
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Adrien
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Le ven. 14 juin 2019 à 15:11, Jérôme Augé <
> > > > >> >>>> jerome.a...@anakeen.com
> > > > >> >>>>>>
> > > > >> >>>>>> a
> > > > >> >>>>>>>>> écrit :
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>> Ok, so I'll setup a cron job to journalize (every
> > minute?)
> > > > >> >>> the
> > > > >> >>>>>> output
> > > > >> >>>>>>>>> from
> > > > >> >>>>>>>>>> "/_node/_local/_system" and wait for the next OOM kill.
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Any property from "_system" to look for in particular?
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Here is a link to the memory usage graph:
> > > > >> >>>>>>>>>> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> The memory usage varies, but the general trend is to go
> > up
> > > > >> >>> with
> > > > >> >>>>>> some
> > > > >> >>>>>>>>>> regularity over a week until we reach OOM. When
> > "beam.smp"
> > > > >> >> is
> > > > >> >>>>>> killed,
> > > > >> >>>>>>>>> it's
> > > > >> >>>>>>>>>> reported as consuming 15 GB (as seen in the kernel's
> OOM
> > > > >> >>> trace
> > > > >> >>>> in
> > > > >> >>>>>>>>> syslog).
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Thanks,
> > > > >> >>>>>>>>>> Jérôme
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <
> > > > >> >>>>> kocol...@apache.org>
> > > > >> >>>>>> a
> > > > >> >>>>>>>>>> écrit :
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>>> Hi Jérôme,
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> Thanks for a well-written and detailed report (though
> > the
> > > > >> >>>>> mailing
> > > > >> >>>>>>>> list
> > > > >> >>>>>>>>>>> strips attachments). The _system endpoint provides a
> lot
> > > > >> >> of
> > > > >> >>>>>> useful
> > > > >> >>>>>>>> data
> > > > >> >>>>>>>>>> for
> > > > >> >>>>>>>>>>> debugging these kinds of situations; do you have a
> > > > >> >> snapshot
> > > > >> >>>> of
> > > > >> >>>>>> the
> > > > >> >>>>>>>>> output
> > > > >> >>>>>>>>>>> when the system was consuming a lot of memory?
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> Adam
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé <
> > > > >> >>>>>>> jerome.a...@anakeen.com>
> > > > >> >>>>>>>>>>> wrote:
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> Hi,
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> I'm having a hard time figuring out the high memory
> > > > >> >> usage
> > > > >> >>>> of
> > > > >> >>>>> a
> > > > >> >>>>>>>>> CouchDB
> > > > >> >>>>>>>>>>> server.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> What I'm observing is that the memory consumption
> from
> > > > >> >>> the
> > > > >> >>>>>>>> "beam.smp"
> > > > >> >>>>>>>>>>> process gradually rises until it triggers the kernel's
> > > > >> >> OOM
> > > > >> >>>>>>>>>> (Out-Of-Memory)
> > > > >> >>>>>>>>>>> which kill the "beam.smp" process.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> It also seems that many databases are not compacted:
> > > > >> >> I've
> > > > >> >>>>> made
> > > > >> >>>>>> a
> > > > >> >>>>>>>>> script
> > > > >> >>>>>>>>>>> to iterate over the databases to compute de
> > fragmentation
> > > > >> >>>>> factor,
> > > > >> >>>>>>> and
> > > > >> >>>>>>>>> it
> > > > >> >>>>>>>>>>> seems I have around 2100 databases with a frag > 70%.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> We have a single CouchDB v2.1.1server (configured
> with
> > > > >> >>> q=8
> > > > >> >>>>> n=1)
> > > > >> >>>>>>> and
> > > > >> >>>>>>>>>>> around 2770 databases.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> The server initially had 4 GB of RAM, and we are now
> > > > >> >> with
> > > > >> >>>> 16
> > > > >> >>>>> GB
> > > > >> >>>>>>> w/
> > > > >> >>>>>>>> 8
> > > > >> >>>>>>>>>>> vCPU, and it still regularly reaches OOM. From the
> > > > >> >>>> monitoring I
> > > > >> >>>>>> see
> > > > >> >>>>>>>>> that
> > > > >> >>>>>>>>>>> with 16 GB the OOM is almost triggered once per week
> > > > >> >> (c.f.
> > > > >> >>>>>> attached
> > > > >> >>>>>>>>>> graph).
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> The memory usage seems to increase gradually until it
> > > > >> >>>> reaches
> > > > >> >>>>>>> OOM.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> The Couch server is mostly used by web clients with
> the
> > > > >> >>>>> PouchDB
> > > > >> >>>>>>> JS
> > > > >> >>>>>>>>> API.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> We have ~1300 distinct users and by monitoring the
> > > > >> >>>>> netstat/TCP
> > > > >> >>>>>>>>>>> established connections I guess we have around 100
> > > > >> >>> (maximum)
> > > > >> >>>>>> users
> > > > >> >>>>>>> at
> > > > >> >>>>>>>>> any
> > > > >> >>>>>>>>>>> given time. From what I understanding of the
> > > > >> >> application's
> > > > >> >>>>> logic,
> > > > >> >>>>>>>> each
> > > > >> >>>>>>>>>> user
> > > > >> >>>>>>>>>>> access 2 private databases (read/write) + 1 common
> > > > >> >> database
> > > > >> >>>>>>>>> (read-only).
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> On-disk usage of CouchDB's data directory is around
> 40
> > > > >> >>> GB.
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> Any ideas on what could cause such behavior
> (increasing
> > > > >> >>>>> memory
> > > > >> >>>>>>>> usage
> > > > >> >>>>>>>>>>> over the course of a week)? Or how to find what is
> > > > >> >>> happening
> > > > >> >>>>>> behind
> > > > >> >>>>>>>> the
> > > > >> >>>>>>>>>>> scene?
> > > > >> >>>>>>>>>>>>
> > > > >> >>>>>>>>>>>> Regards,
> > > > >> >>>>>>>>>>>> Jérôme
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Reply via email to