Le jeu. 27 juin 2019 à 14:06, Jérôme Augé <[email protected]> a
écrit :

> I have no "erl_crash.dump" neither, but I guess it's because the erlang
> process is hard-killed by the kernel, and it does not have time for dumping
> its state...
>
> @Adrien : what version of CouchDB are you using ?
>

Oops: just saw you are in 2.3.1.

>
>
> Le jeu. 27 juin 2019 à 12:27, Adrien Vergé <[email protected]> a
> écrit :
>
>> Vladimir, I would love to have your debugging skills and feedback on this,
>> thanks for proposing! Unfortunately this only happens on real servers,
>> after
>> weeks of continuous real-life requests. In the past, we tried to reproduce
>> it
>> on test servers, but the memory leak doesn't happen if CouchDB is not very
>> active (or it happens, but unnoticeable because too slowly). And these
>> servers
>> contain protected data that our rules don't allow us to share.
>>
>> I also searched for the crash dump (sudo find / -name '*.dump'; sudo find
>> /
>> -name 'erl_crash*') but couldn't find it; do you where it could be
>> located?
>>
>> We already have swap on these machines. Next time the system comes close
>> the
>> OOM point, I will try to see whether they use swap or not.
>>
>> Le mer. 26 juin 2019 à 12:43, Vladimir Ralev <[email protected]> a
>> écrit :
>>
>> > Ouch. I have an idea, can you add a bunch of swap on one of those
>> machines,
>> > say 20gigs, this should allow the machines to work for a little longer
>> in
>> > slow mode instead of running out of memory, which will buy you time to
>> run
>> > more diagnostics after the incident occurs. This will probably reduce
>> the
>> > response times a lot though and might break your apps.
>> >
>> > Also can you upload that erl_crash.dump file that the crash generated?
>> >
>> > PS I would love to get a shell access to a system like that, if you can
>> > reproduce the issue on a test machine and give me access I should be
>> able
>> > to come up with something. Free of charge.
>> >
>> > On Wed, Jun 26, 2019 at 1:17 PM Adrien Vergé <[email protected]>
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > Here is more feedback, since one of our CouchDB servers crashed last
>> > night.
>> > >
>> > > - Setup: CouchDB 2.3.1 on a 3-node cluster (q=2 n=3) with ~50k small
>> > > databases.
>> > >
>> > > - Only one of the 3 nodes crashed. Others should crash in a few days
>> > (they
>> > >   usually crash and restart every ~3 weeks).
>> > >
>> > > - update_lru_on_read = false
>> > >
>> > > - The extra memory consumption comes from beam.smp process (see graph
>> > > below).
>> > >
>> > > - The crash is an OOM, see the last log lines before restart:
>> > >
>> > >       eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type
>> > > "old_heap").
>> > >       Crash dump is being written to: erl_crash.dump...
>> > >       [os_mon] memory supervisor port (memsup): Erlang has closed
>> > >       [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
>> > >
>> > > - Over last weeks, beam.smp memory usage kept increasing and
>> increasing.
>> > > See
>> > >   the graph I made at
>> https://framapic.org/jnHAyVEKq98k/kXCQv3pyUdz0.png
>> > >
>> > > - /_node/_local/_system metrics look normal. The difference between an
>> > > "about
>> > >   to crash" node and a "freshly restarted and lots of free RAM" node,
>> is
>> > in
>> > >   uptime, memory.processes_used, memory.binary, context_switches,
>> > > reductions,
>> > >   garbage_collection_count, io_input... as previously discussed with
>> > Adam.
>> > >
>> > > - This command gives exactly the same output on an "about to crash"
>> node,
>> > > than
>> > >   on a "freshly restarted and lots of free RAM" node:
>> > >
>> > >       MQSizes2 = lists:map(fun(A) -> {_,B} = case
>> > > process_info(A,total_heap_size)
>> > >       of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0}
>> end,
>> > > {B,A}
>> > >       end, processes()).
>> > >
>> > > Le jeu. 20 juin 2019 à 17:08, Jérôme Augé <[email protected]> a
>> > > écrit :
>> > >
>> > > > We are going to plan an upgrade from 2.1.1 to 2.3.1 in the coming
>> > weeks.
>> > > >
>> > > > I have a side question concerning CouchDB's upgrades: is the
>> database
>> > > > binary compatible between v2.1.1 and v2.3.1? In the case we ever
>> need
>> > to
>> > > > downgrade back to 2.1.1, do the binary data can be kept?
>> > > >
>> > > > Regards,
>> > > > Jérôme
>> > > >
>> > > > Le mer. 19 juin 2019 à 08:59, Jérôme Augé <[email protected]>
>> a
>> > > > écrit :
>> > > >
>> > > > > Thanks Adam for your explanations!
>> > > > >
>> > > > > The "update_lru_on_read" is already set to false on this instance
>> (I
>> > > had
>> > > > > already seen the comments on these pull-requests).
>> > > > >
>> > > > > We are effectively running an "old" 2.1.1 version, and we have
>> > advised
>> > > > the
>> > > > > client that an upgrade might be needed to sort out (or further
>> > > > investigate)
>> > > > > these problems.
>> > > > >
>> > > > > Thanks again,
>> > > > > Jérôme
>> > > > >
>> > > > >
>> > > > >
>> > > > > Le mar. 18 juin 2019 à 18:59, Adam Kocoloski <[email protected]>
>> a
>> > > > > écrit :
>> > > > >
>> > > > >> Hi Jérôme, definitely useful.
>> > > > >>
>> > > > >> The “run_queue” is the number of Erlang processes in a runnable
>> > state
>> > > > >> that are not currently executing on a scheduler. When that value
>> is
>> > > > greater
>> > > > >> than zero it means the node is hitting some compute limitations.
>> > > Seeing
>> > > > a
>> > > > >> small positive value from time to time is no problem.
>> > > > >>
>> > > > >> Your last six snapshots show a message queue backlog in
>> > couch_server.
>> > > > >> That could be what caused the node to OOM. The couch_server
>> process
>> > > is a
>> > > > >> singleton and if it accumulates a large message backlog there are
>> > > > limited
>> > > > >> backpressure or scaling mechanisms to help it recover. I noticed
>> > > you’re
>> > > > >> running 2.1.1; there were a couple of important enhancements to
>> > reduce
>> > > > the
>> > > > >> message flow through couch_server in more recent releases:
>> > > > >>
>> > > > >> 2.2.0: https://github.com/apache/couchdb/pull/1118 <
>> > > > >> https://github.com/apache/couchdb/pull/1118>
>> > > > >> 2.3.1: https://github.com/apache/couchdb/pull/1593 <
>> > > > >> https://github.com/apache/couchdb/pull/1593>
>> > > > >>
>> > > > >> The change in 2.2.0 is just a change in the default
>> configuration;
>> > you
>> > > > >> can try applying it to your server by setting:
>> > > > >>
>> > > > >> [couchdb]
>> > > > >> update_lru_on_read = false
>> > > > >>
>> > > > >> The changes in 2.3.1 offer additional benefits for couch_server
>> > > message
>> > > > >> throughput but you’ll need to upgrade to get them.
>> > > > >>
>> > > > >> Cheers, Adam
>> > > > >>
>> > > > >> P.S. II don’t know what’s going on with the negative memory.other
>> > > value
>> > > > >> there, it’s not intentionally meaningful :)
>> > > > >>
>> > > > >>
>> > > > >> > On Jun 18, 2019, at 11:30 AM, Jérôme Augé <
>> > [email protected]>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > "beam.smp" just got killed by OOM, but I was not in front of
>> the
>> > > > >> machine to
>> > > > >> > perform this command...
>> > > > >> >
>> > > > >> > However, here is the CouchDB log of "/_node/_local/_system" for
>> > the
>> > > 30
>> > > > >> > minutes preceding the OOM:
>> > > > >> > -
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log
>> > > > >> >
>> > > > >> > I guess the spike that triggers the OOM is so quick (< 1min)
>> that
>> > it
>> > > > >> does
>> > > > >> > not gets logged (I log every minute).
>> > > > >> >
>> > > > >> > Is there anything that can be used/deduced from the last line
>> > logged
>> > > > at
>> > > > >> > 2019-06-18T16:00:14+0200?
>> > > > >> >
>> > > > >> > At 15:55:25, the "run_queue" is at 36: what does it means?
>> Number
>> > of
>> > > > >> active
>> > > > >> > concurrent requests?
>> > > > >> >
>> > > > >> > From 15:56 to 16:00 the "memory"."other" value is a negative
>> > value:
>> > > > >> does it
>> > > > >> > means something special? or just an integer overflow?
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > Le lun. 17 juin 2019 à 14:09, Vladimir Ralev <
>> > > > [email protected]>
>> > > > >> a
>> > > > >> > écrit :
>> > > > >> >
>> > > > >> >> Alright, I think the issue will be more visible towards the
>> OOM
>> > > > point,
>> > > > >> >> however for now since you have the system live with a leak, it
>> > will
>> > > > be
>> > > > >> >> useful to repeat the same steps, but replace
>> > > > >> >> "message_queue_len" with "total_heap_size" then with
>> "heap_size"
>> > > then
>> > > > >> with
>> > > > >> >> "stack_size" and then with "reductions".
>> > > > >> >>
>> > > > >> >> For example:
>> > > > >> >>
>> > > > >> >> MQSizes2 = lists:map(fun(A) -> {_,B} = case
>> > > > >> process_info(A,total_heap_size)
>> > > > >> >> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0}
>> > end,
>> > > > >> {B,A}
>> > > > >> >> end, processes()).
>> > > > >> >>
>> > > > >> >> Then same with the other params.
>> > > > >> >>
>> > > > >> >> That can shed some light, otherwise someone will need to
>> monitor
>> > > > >> process
>> > > > >> >> count and go into them by age and memory patterns.
>> > > > >> >>
>> > > > >> >> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé <
>> > > [email protected]
>> > > > >
>> > > > >> >> wrote:
>> > > > >> >>
>> > > > >> >>> The 2G consumption is from Adrien's system.
>> > > > >> >>>
>> > > > >> >>> On mine, since I setup the logging of "/_node/_local/_system"
>> > > > output :
>> > > > >> >>> - on june 14th max memory.processes was 2.6 GB
>> > > > >> >>> - on june 15th max memory.processes was 4.7 GB
>> > > > >> >>> - on june 16th max memory.processes was 7.0 GB
>> > > > >> >>> - today (june 17th) max memory.processes was 8.0 GB (and
>> with an
>> > > > >> >>> interactive top I see spikes at 12 GB)
>> > > > >> >>>
>> > > > >> >>> The memory.processes seems to be steadily increasing over the
>> > > days,
>> > > > >> and
>> > > > >> >> I'm
>> > > > >> >>> soon expecting the out-of-memory condition to be triggered
>> in a
>> > > > >> couple of
>> > > > >> >>> days.
>> > > > >> >>>
>> > > > >> >>> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev <
>> > > > >> [email protected]>
>> > > > >> >> a
>> > > > >> >>> écrit :
>> > > > >> >>>
>> > > > >> >>>> Nothing to see here, the message queue stat from Adam's
>> advice
>> > is
>> > > > >> >>> accurate.
>> > > > >> >>>> Note that you should run this only when there is already an
>> > > > >> >> unreasonable
>> > > > >> >>>> amount memory leaked/consumed.
>> > > > >> >>>>
>> > > > >> >>>> But now I realise you had "processes":1877591424 before
>> restart
>> > > > from
>> > > > >> >> the
>> > > > >> >>>> stats above which is less than 2G. Are you using only 2
>> gigs of
>> > > > RAM?
>> > > > >> I
>> > > > >> >>> got
>> > > > >> >>>> confused by the initial comment and I thought you had 15GB
>> RAM.
>> > > If
>> > > > >> you
>> > > > >> >>> are
>> > > > >> >>>> only using 2 gigs of RAM, it's probably not enough for your
>> > > > workload.
>> > > > >> >>>>
>> > > > >> >>>> On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé <
>> > > > >> [email protected]>
>> > > > >> >>>> wrote:
>> > > > >> >>>>
>> > > > >> >>>>> That command seems to work, and here is the output:
>> > > > >> >>>>>
>> > > > >> >>>>> --8<--
>> > > > >> >>>>> # /opt/couchdb/bin/remsh < debug.2.remsh
>> > > > >> >>>>> Eshell V7.3  (abort with ^G)
>> > > > >> >>>>> ([email protected])1> [{0,<0.0.0>},
>> > > > >> >>>>> {0,<0.3.0>},
>> > > > >> >>>>> {0,<0.6.0>},
>> > > > >> >>>>> {0,<0.7.0>},
>> > > > >> >>>>> {0,<0.9.0>},
>> > > > >> >>>>> {0,<0.10.0>},
>> > > > >> >>>>> {0,<0.11.0>},
>> > > > >> >>>>> {0,<0.12.0>},
>> > > > >> >>>>> {0,<0.14.0>},
>> > > > >> >>>>> {0,<0.15.0>},
>> > > > >> >>>>> {0,<0.16.0>},
>> > > > >> >>>>> {0,<0.17.0>},
>> > > > >> >>>>> {0,<0.18.0>},
>> > > > >> >>>>> {0,<0.19.0>},
>> > > > >> >>>>> {0,<0.20.0>},
>> > > > >> >>>>> {0,<0.21.0>},
>> > > > >> >>>>> {0,<0.22.0>},
>> > > > >> >>>>> {0,<0.23.0>},
>> > > > >> >>>>> {0,<0.24.0>},
>> > > > >> >>>>> {0,<0.25.0>},
>> > > > >> >>>>> {0,<0.26.0>},
>> > > > >> >>>>> {0,<0.27.0>},
>> > > > >> >>>>> {0,<0.28.0>},
>> > > > >> >>>>> {0,<0.29.0>},
>> > > > >> >>>>> {0,<0.31.0>},
>> > > > >> >>>>> {0,<0.32.0>},
>> > > > >> >>>>> {0,<0.33.0>},
>> > > > >> >>>>> {0,...},
>> > > > >> >>>>> {...}]
>> > > > >> >>>>> ([email protected])2> {0,<0.38.0>}
>> > > > >> >>>>> ([email protected])3>
>> > > > [{current_function,{erl_eval,do_apply,6}},
>> > > > >> >>>>> {initial_call,{erlang,apply,2}},
>> > > > >> >>>>> {status,running},
>> > > > >> >>>>> {message_queue_len,0},
>> > > > >> >>>>> {messages,[]},
>> > > > >> >>>>> {links,[<0.32.0>]},
>> > > > >> >>>>> {dictionary,[]},
>> > > > >> >>>>> {trap_exit,false},
>> > > > >> >>>>> {error_handler,error_handler},
>> > > > >> >>>>> {priority,normal},
>> > > > >> >>>>> {group_leader,<0.31.0>},
>> > > > >> >>>>> {total_heap_size,5172},
>> > > > >> >>>>> {heap_size,2586},
>> > > > >> >>>>> {stack_size,24},
>> > > > >> >>>>> {reductions,24496},
>> > > > >> >>>>> {garbage_collection,[{min_bin_vheap_size,46422},
>> > > > >> >>>>>                      {min_heap_size,233},
>> > > > >> >>>>>                      {fullsweep_after,65535},
>> > > > >> >>>>>                      {minor_gcs,1}]},
>> > > > >> >>>>> {suspending,[]}]
>> > > > >> >>>>> ([email protected])4> *** Terminating erlang ('
>> > > > >> >> [email protected]
>> > > > >> >>> ')
>> > > > >> >>>>> -->8--
>> > > > >> >>>>>
>> > > > >> >>>>> What should I be looking for in this output?
>> > > > >> >>>>>
>> > > > >> >>>>> Le ven. 14 juin 2019 à 17:30, Vladimir Ralev <
>> > > > >> >> [email protected]
>> > > > >> >>>>
>> > > > >> >>>> a
>> > > > >> >>>>> écrit :
>> > > > >> >>>>>
>> > > > >> >>>>>> That means your couch is creating and destroying processes
>> > too
>> > > > >> >>>> rapidly. I
>> > > > >> >>>>>> haven't seen this, however I think Adam's message_queues
>> stat
>> > > > above
>> > > > >> >>>> does
>> > > > >> >>>>>> the same thing. I didn't notice you can get it from there.
>> > > > >> >>>>>>
>> > > > >> >>>>>> Either way it will be useful if you can get the shell to
>> > work:
>> > > > >> >>>>>> Try this command instead for the first, the rest will be
>> the
>> > > > same:
>> > > > >> >>>>>>
>> > > > >> >>>>>> MQSizes2 = lists:map(fun(A) -> {_,B} = case
>> > > > >> >>>>>> process_info(A,message_queue_len) of {XV,XB} -> {XV, XB};
>> > _ERR
>> > > ->
>> > > > >> >>>>>> io:format("~p",[_ERR]),{ok, 0} end, {B,A} end,
>> processes()).
>> > > > >> >>>>>>
>> > > > >> >>>>>> On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé <
>> > > > >> >> [email protected]
>> > > > >> >>>>
>> > > > >> >>>>>> wrote:
>> > > > >> >>>>>>
>> > > > >> >>>>>>> I tried the following, but it seems to fail on the first
>> > > > command:
>> > > > >> >>>>>>>
>> > > > >> >>>>>>> --8<--
>> > > > >> >>>>>>> # /opt/couchdb/bin/remsh
>> > > > >> >>>>>>> Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit]
>> [smp:8:8]
>> > > > >> >>>>>>> [async-threads:10] [hipe] [kernel-poll:false]
>> > > > >> >>>>>>>
>> > > > >> >>>>>>> Eshell V7.3  (abort with ^G)
>> > > > >> >>>>>>> ([email protected])1> MQSizes2 = lists:map(fun(A) ->
>> {_,B}
>> > =
>> > > > >> >>>>>>> process_info(A,message_queue_len), {B,A} end,
>> processes()).
>> > > > >> >>>>>>> ** exception error: no match of right hand side value
>> > > undefined
>> > > > >> >>>>>>> -->8--
>> > > > >> >>>>>>>
>> > > > >> >>>>>>>
>> > > > >> >>>>>>> Le ven. 14 juin 2019 à 16:08, Vladimir Ralev <
>> > > > >> >>>> [email protected]
>> > > > >> >>>>>>
>> > > > >> >>>>>> a
>> > > > >> >>>>>>> écrit :
>> > > > >> >>>>>>>
>> > > > >> >>>>>>>> Hey guys. I bet it's a mailbox leaking memory. I am very
>> > > > >> >>> interested
>> > > > >> >>>>> in
>> > > > >> >>>>>>>> debugging issues like this too.
>> > > > >> >>>>>>>>
>> > > > >> >>>>>>>> I can suggest to get an erlang shell and run these
>> commands
>> > > to
>> > > > >> >>> see
>> > > > >> >>>>> the
>> > > > >> >>>>>>> top
>> > > > >> >>>>>>>> memory consuming processes
>> > > > >> >>>>>>>>
>> > > > >> >>>
>> > > https://www.mail-archive.com/[email protected]/msg29365.html
>> > > > >> >>>>>>>>
>> > > > >> >>>>>>>> One issue I will be reporting soon is if one of your
>> nodes
>> > is
>> > > > >> >>> down
>> > > > >> >>>>> for
>> > > > >> >>>>>>> some
>> > > > >> >>>>>>>> amount of time, it seems like all databases
>> independently
>> > try
>> > > > >> >> and
>> > > > >> >>>>> retry
>> > > > >> >>>>>>> to
>> > > > >> >>>>>>>> query the missing node and fail, resulting in printing a
>> > lot
>> > > of
>> > > > >> >>>> logs
>> > > > >> >>>>>> for
>> > > > >> >>>>>>>> each db which can overwhelm the logger process. If you
>> > have a
>> > > > >> >> lot
>> > > > >> >>>> of
>> > > > >> >>>>>> DBs
>> > > > >> >>>>>>>> this makes the problem worse, but it doesn't happen
>> right
>> > > away
>> > > > >> >>> for
>> > > > >> >>>>> some
>> > > > >> >>>>>>>> reason.
>> > > > >> >>>>>>>>
>> > > > >> >>>>>>>> On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé <
>> > > > >> >>>>> [email protected]
>> > > > >> >>>>>>>
>> > > > >> >>>>>>>> wrote:
>> > > > >> >>>>>>>>
>> > > > >> >>>>>>>>> Hi Jérôme and Adam,
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> That's funny, because I'm investigating the exact same
>> > > > >> >> problem
>> > > > >> >>>>> these
>> > > > >> >>>>>>>> days.
>> > > > >> >>>>>>>>> We have a two CouchDB setups:
>> > > > >> >>>>>>>>> - a one-node server (q=2 n=1) with 5000 databases
>> > > > >> >>>>>>>>> - a 3-node cluster (q=2 n=3) with 50000 databases
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> ... and we are experiencing the problem on both setups.
>> > > We've
>> > > > >> >>>> been
>> > > > >> >>>>>>> having
>> > > > >> >>>>>>>>> this problem for at least 3-4 months.
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> We've monitored:
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> - The number of open files: it's relatively low (both
>> the
>> > > > >> >>>> system's
>> > > > >> >>>>>>> total
>> > > > >> >>>>>>>>> and or fds opened by beam.smp).
>> > > > >> >>>>>>>>>  https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> - The usage of RAM, total used and used by beam.smp
>> > > > >> >>>>>>>>>  https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
>> > > > >> >>>>>>>>>  It continuously grows, with regular spikes, until
>> killing
>> > > > >> >>>> CouchDB
>> > > > >> >>>>>>> with
>> > > > >> >>>>>>>> an
>> > > > >> >>>>>>>>> OOM. After restart, the RAM usage is nice and low, and
>> no
>> > > > >> >>> spikes.
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> - /_node/_local/_system metrics, before and after
>> restart.
>> > > > >> >>> Values
>> > > > >> >>>>>> that
>> > > > >> >>>>>>>>> significantly differ (before / after restart) are
>> listed
>> > > > >> >> here:
>> > > > >> >>>>>>>>>  - uptime (obviously ;-))
>> > > > >> >>>>>>>>>  - memory.processes : + 3732 %
>> > > > >> >>>>>>>>>  - memory.processes_used : + 3735 %
>> > > > >> >>>>>>>>>  - memory.binary : + 17700 %
>> > > > >> >>>>>>>>>  - context_switches : + 17376 %
>> > > > >> >>>>>>>>>  - reductions : + 867832 %
>> > > > >> >>>>>>>>>  - garbage_collection_count : + 448248 %
>> > > > >> >>>>>>>>>  - words_reclaimed : + 112755 %
>> > > > >> >>>>>>>>>  - io_input : + 44226 %
>> > > > >> >>>>>>>>>  - io_output : + 157951 %
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> Before CouchDB restart:
>> > > > >> >>>>>>>>> {
>> > > > >> >>>>>>>>>  "uptime":2712973,
>> > > > >> >>>>>>>>>  "memory":{
>> > > > >> >>>>>>>>>    "other":7250289,
>> > > > >> >>>>>>>>>    "atom":512625,
>> > > > >> >>>>>>>>>    "atom_used":510002,
>> > > > >> >>>>>>>>>    "processes":1877591424,
>> > > > >> >>>>>>>>>    "processes_used":1877504920,
>> > > > >> >>>>>>>>>    "binary":177468848,
>> > > > >> >>>>>>>>>    "code":9653286,
>> > > > >> >>>>>>>>>    "ets":16012736
>> > > > >> >>>>>>>>>  },
>> > > > >> >>>>>>>>>  "run_queue":0,
>> > > > >> >>>>>>>>>  "ets_table_count":102,
>> > > > >> >>>>>>>>>  "context_switches":1621495509,
>> > > > >> >>>>>>>>>  "reductions":968705947589,
>> > > > >> >>>>>>>>>  "garbage_collection_count":331826928,
>> > > > >> >>>>>>>>>  "words_reclaimed":269964293572,
>> > > > >> >>>>>>>>>  "io_input":8812455,
>> > > > >> >>>>>>>>>  "io_output":20733066,
>> > > > >> >>>>>>>>>  ...
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> After CouchDB restart:
>> > > > >> >>>>>>>>> {
>> > > > >> >>>>>>>>>  "uptime":206,
>> > > > >> >>>>>>>>>  "memory":{
>> > > > >> >>>>>>>>>    "other":6907493,
>> > > > >> >>>>>>>>>    "atom":512625,
>> > > > >> >>>>>>>>>    "atom_used":497769,
>> > > > >> >>>>>>>>>    "processes":49001944,
>> > > > >> >>>>>>>>>    "processes_used":48963168,
>> > > > >> >>>>>>>>>    "binary":997032,
>> > > > >> >>>>>>>>>    "code":9233842,
>> > > > >> >>>>>>>>>    "ets":4779576
>> > > > >> >>>>>>>>>  },
>> > > > >> >>>>>>>>>  "run_queue":0,
>> > > > >> >>>>>>>>>  "ets_table_count":102,
>> > > > >> >>>>>>>>>  "context_switches":1015486,
>> > > > >> >>>>>>>>>  "reductions":111610788,
>> > > > >> >>>>>>>>>  "garbage_collection_count":74011,
>> > > > >> >>>>>>>>>  "words_reclaimed":239214127,
>> > > > >> >>>>>>>>>  "io_input":19881,
>> > > > >> >>>>>>>>>  "io_output":13118,
>> > > > >> >>>>>>>>>  ...
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> Adrien
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>> Le ven. 14 juin 2019 à 15:11, Jérôme Augé <
>> > > > >> >>>> [email protected]
>> > > > >> >>>>>>
>> > > > >> >>>>>> a
>> > > > >> >>>>>>>>> écrit :
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>>> Ok, so I'll setup a cron job to journalize (every
>> > minute?)
>> > > > >> >>> the
>> > > > >> >>>>>> output
>> > > > >> >>>>>>>>> from
>> > > > >> >>>>>>>>>> "/_node/_local/_system" and wait for the next OOM
>> kill.
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>> Any property from "_system" to look for in particular?
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>> Here is a link to the memory usage graph:
>> > > > >> >>>>>>>>>> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>> The memory usage varies, but the general trend is to
>> go
>> > up
>> > > > >> >>> with
>> > > > >> >>>>>> some
>> > > > >> >>>>>>>>>> regularity over a week until we reach OOM. When
>> > "beam.smp"
>> > > > >> >> is
>> > > > >> >>>>>> killed,
>> > > > >> >>>>>>>>> it's
>> > > > >> >>>>>>>>>> reported as consuming 15 GB (as seen in the kernel's
>> OOM
>> > > > >> >>> trace
>> > > > >> >>>> in
>> > > > >> >>>>>>>>> syslog).
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>> Thanks,
>> > > > >> >>>>>>>>>> Jérôme
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <
>> > > > >> >>>>> [email protected]>
>> > > > >> >>>>>> a
>> > > > >> >>>>>>>>>> écrit :
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>>> Hi Jérôme,
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>> Thanks for a well-written and detailed report (though
>> > the
>> > > > >> >>>>> mailing
>> > > > >> >>>>>>>> list
>> > > > >> >>>>>>>>>>> strips attachments). The _system endpoint provides a
>> lot
>> > > > >> >> of
>> > > > >> >>>>>> useful
>> > > > >> >>>>>>>> data
>> > > > >> >>>>>>>>>> for
>> > > > >> >>>>>>>>>>> debugging these kinds of situations; do you have a
>> > > > >> >> snapshot
>> > > > >> >>>> of
>> > > > >> >>>>>> the
>> > > > >> >>>>>>>>> output
>> > > > >> >>>>>>>>>>> when the system was consuming a lot of memory?
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>
>> > > > >> >>>>>>>
>> > > > >> >>>>>>
>> > > > >> >>>>>
>> > > > >> >>>>
>> > > > >> >>>
>> > > > >> >>
>> > > > >>
>> > > >
>> > >
>> >
>> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>> Adam
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé <
>> > > > >> >>>>>>> [email protected]>
>> > > > >> >>>>>>>>>>> wrote:
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> Hi,
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> I'm having a hard time figuring out the high memory
>> > > > >> >> usage
>> > > > >> >>>> of
>> > > > >> >>>>> a
>> > > > >> >>>>>>>>> CouchDB
>> > > > >> >>>>>>>>>>> server.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> What I'm observing is that the memory consumption
>> from
>> > > > >> >>> the
>> > > > >> >>>>>>>> "beam.smp"
>> > > > >> >>>>>>>>>>> process gradually rises until it triggers the
>> kernel's
>> > > > >> >> OOM
>> > > > >> >>>>>>>>>> (Out-Of-Memory)
>> > > > >> >>>>>>>>>>> which kill the "beam.smp" process.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> It also seems that many databases are not compacted:
>> > > > >> >> I've
>> > > > >> >>>>> made
>> > > > >> >>>>>> a
>> > > > >> >>>>>>>>> script
>> > > > >> >>>>>>>>>>> to iterate over the databases to compute de
>> > fragmentation
>> > > > >> >>>>> factor,
>> > > > >> >>>>>>> and
>> > > > >> >>>>>>>>> it
>> > > > >> >>>>>>>>>>> seems I have around 2100 databases with a frag > 70%.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> We have a single CouchDB v2.1.1server (configured
>> with
>> > > > >> >>> q=8
>> > > > >> >>>>> n=1)
>> > > > >> >>>>>>> and
>> > > > >> >>>>>>>>>>> around 2770 databases.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> The server initially had 4 GB of RAM, and we are now
>> > > > >> >> with
>> > > > >> >>>> 16
>> > > > >> >>>>> GB
>> > > > >> >>>>>>> w/
>> > > > >> >>>>>>>> 8
>> > > > >> >>>>>>>>>>> vCPU, and it still regularly reaches OOM. From the
>> > > > >> >>>> monitoring I
>> > > > >> >>>>>> see
>> > > > >> >>>>>>>>> that
>> > > > >> >>>>>>>>>>> with 16 GB the OOM is almost triggered once per week
>> > > > >> >> (c.f.
>> > > > >> >>>>>> attached
>> > > > >> >>>>>>>>>> graph).
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> The memory usage seems to increase gradually until
>> it
>> > > > >> >>>> reaches
>> > > > >> >>>>>>> OOM.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> The Couch server is mostly used by web clients with
>> the
>> > > > >> >>>>> PouchDB
>> > > > >> >>>>>>> JS
>> > > > >> >>>>>>>>> API.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> We have ~1300 distinct users and by monitoring the
>> > > > >> >>>>> netstat/TCP
>> > > > >> >>>>>>>>>>> established connections I guess we have around 100
>> > > > >> >>> (maximum)
>> > > > >> >>>>>> users
>> > > > >> >>>>>>> at
>> > > > >> >>>>>>>>> any
>> > > > >> >>>>>>>>>>> given time. From what I understanding of the
>> > > > >> >> application's
>> > > > >> >>>>> logic,
>> > > > >> >>>>>>>> each
>> > > > >> >>>>>>>>>> user
>> > > > >> >>>>>>>>>>> access 2 private databases (read/write) + 1 common
>> > > > >> >> database
>> > > > >> >>>>>>>>> (read-only).
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> On-disk usage of CouchDB's data directory is around
>> 40
>> > > > >> >>> GB.
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> Any ideas on what could cause such behavior
>> (increasing
>> > > > >> >>>>> memory
>> > > > >> >>>>>>>> usage
>> > > > >> >>>>>>>>>>> over the course of a week)? Or how to find what is
>> > > > >> >>> happening
>> > > > >> >>>>>> behind
>> > > > >> >>>>>>>> the
>> > > > >> >>>>>>>>>>> scene?
>> > > > >> >>>>>>>>>>>>
>> > > > >> >>>>>>>>>>>> Regards,
>> > > > >> >>>>>>>>>>>> Jérôme
>> > > > >> >>>>>>>>>>>
>> > > > >> >>>>>>>>>>
>> > > > >> >>>>>>>>>
>> > > > >> >>>>>>>>
>> > > > >> >>>>>>>
>> > > > >> >>>>>>
>> > > > >> >>>>>
>> > > > >> >>>>
>> > > > >> >>>
>> > > > >> >>
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> >
>>
>

Reply via email to