Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus:
https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0

I’m not sure if it will provide all the metrics you need, but I hope this
helps

Geoff
On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso <[email protected]>
wrote:

> I'd like to connect a diagnosing tool such as etop, observer, ... to see
> which processes are open there but I cannot seem to have it working.
>
> Could anyone please share how to run any of those tools on a remote server?
>
> Regards
>
> On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso <[email protected]>
> wrote:
>
> > So I could find another relevant symptom. After adding _system endpoint
> > monitoring I have discovered that the particular node has a different
> > behaviour than the other ones in terms of Erlang process count.
> >
> > The process_count metric of the normal nodes is stable around 1k to 1.3k
> > while the other node's process_count is slowly but continuously growing
> > until a little above than 5k processes that is when it gets 'frozen'.
> After
> > restarting the value comes back to the normal 1k to 1.3k (to immediately
> > start slowly growing again, of course :)).
> >
> > Any idea? Thanks!
> >
> > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <[email protected]>
> > wrote:
> >
> >> This is one of the complete errors sequences I can see:
> >>
> >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
> >> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1
> '
> >> with exit value:
> >>
> >>
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
> >>
> >>
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
> >>
> >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204>
> >> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok
> >>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> >> L295">>,<<"chttpd:handle_request_int/1
> L231">>,<<"mochiweb_http:headers/6
> >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1
> <0.20718.207>
> >> -------- Replicator, request PUT to "
> >>
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false
> "
> >> failed due to error {error,
> >>     {'EXIT',
> >>         {{{nocatch,{mp_parser_died,noproc}},
> >> ...
> >>
> >> Regards
> >>
> >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <[email protected]
> >
> >> wrote:
> >>
> >>> The 'weird' thing about the mp_parser_died error is that, according to
> >>> the description of the issue 745, the replication never finishes as the
> >>> item that fails once, seems to fail forever, but in my case they fail,
> but
> >>> then they seem to work (possibly as the replication is retried), as I
> can
> >>> find the documents that generated the error (in the logs) in the target
> >>> db...
> >>>
> >>> Regards
> >>>
> >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <
> [email protected]>
> >>> wrote:
> >>>
> >>>> So to give some more context this node is responsible for replicating
> a
> >>>> database that has quite many attachments and it raises the 'famous'
> >>>> mp_parser_died,noproc error, that I think is this one:
> >>>> https://github.com/apache/couchdb/issues/745
> >>>>
> >>>> What I've identified so far from the logs is that along with the error
> >>>> described above, also this error appears:
> >>>>
> >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
> >>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>)
> >>>> badmatch : ok
> >>>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> >>>> L295">>,<<"chttpd:handle_request_int/1
> L231">>,<<"mochiweb_http:headers/6
> >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> >>>>
> >>>> Sometimes it appears just after the mp_parser_died error, sometimes
> the
> >>>> parser error happens without 'triggering' one of this badmatch ones.
> >>>>
> >>>> Then, after a while of this sequence, the initially described
> >>>> sel_conn_closed error starts raising for all requests and the node
> gets
> >>>> frozen. It is not responsive but it is still not removed from the
> cluster,
> >>>> holding its replications and, obviously, not replicating anything
> until it
> >>>> is restarted.
> >>>>
> >>>> I can also see interleaved unauthorized errors, which don't make much
> >>>> sense as I'm the only one accessing this cluster
> >>>>
> >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
> >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are
> not
> >>>> authorized to access this db.">>} [{couch_db,open,2
> >>>>
> >>>>
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> >>>>
> >>>>
> >>>> To me, it feels like the mp_parser_died error slowly breaks something
> >>>> that in the end brings the node unresponsive, as those errors happen
> a lot
> >>>> in that particular replication.
> >>>>
> >>>> Regards and thanks a lot for your help!
> >>>>
> >>>>
> >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <[email protected]> wrote:
> >>>>
> >>>>> Is there more to the error? All this shows us is that the replicator
> >>>>> itself attempted a POST and had the connection closed on it.
> (Remember
> >>>>> that the replicator is basically just a custom client that sits
> >>>>> alongside CouchDB on the same machine.) There should be more to the
> >>>>> error log that shows why CouchDB hung up the phone.
> >>>>>
> >>>>> ----- Original Message -----
> >>>>> From: "Carlos Alonso" <[email protected]>
> >>>>> To: "user" <[email protected]>
> >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
> >>>>> Subject: Re: Trying to understand why a node gets 'frozen'
> >>>>>
> >>>>> Hello, this is happening every day, always on the same node. Any
> ideas?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
> >>>>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>> > Hello everyone!!
> >>>>> >
> >>>>> > I'm trying to understand an issue we're experiencing on CouchDB
> 2.1.0
> >>>>> > running on Ubuntu 14.04. The cluster itself is currently
> replicating
> >>>>> from
> >>>>> > another source cluster and we have seen that one node gets frozen
> >>>>> from time
> >>>>> > to time having to restart it to get it to respond again.
> >>>>> >
> >>>>> > Before getting unresponsive, the node throws a lot of {error,
> >>>>> > sel_conn_closed}. See an example trace below.
> >>>>> >
> >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
> >>>>> > -------- gen_server <0.13489.0> terminated with reason:
> >>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
> >>>>> > {'EXIT',{http_request_failed,\"POST\",\n
> >>>>>  \"
> >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> >>>>> >        {error,sel_conn_closed}}}">>}
> >>>>> >   last msg:
> >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
> >>>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> >>>>> >                  {error,sel_conn_closed}}}">>}}
> >>>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> >>>>> > https://source_ip/mydb/
> >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> >>>>> >
> >>>>>
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> >>>>> > http://127.0.0.1:5984/mydb/
> >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> >>>>> >
> >>>>>
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
> >>>>> >
> >>>>> > The particular node is 'responsible' for a replication that has
> >>>>> quite many
> >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> >>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if
> >>>>> that
> >>>>> > may have any relationship.
> >>>>> >
> >>>>> > When that happens, just restarting the node brings it up and
> running
> >>>>> > properly.
> >>>>> >
> >>>>> > Any help would be really appreciated.
> >>>>> >
> >>>>> > Regards
> >>>>> > --
> >>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>>> >
> >>>>> > *Carlos Alonso*
> >>>>> > Data Engineer
> >>>>> > Madrid, Spain
> >>>>> >
> >>>>> > [email protected]
> >>>>> >
> >>>>> > Prueba gratis con este código
> >>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >>>>> >[image:
> >>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>>> >
> >>>>> --
> >>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>>>
> >>>>> *Carlos Alonso*
> >>>>> Data Engineer
> >>>>> Madrid, Spain
> >>>>>
> >>>>> [email protected]
> >>>>>
> >>>>> Prueba gratis con este código
> >>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >>>>> >[image:
> >>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>>>
> >>>>> --
> >>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a
> >>>>> su
> >>>>> destinatario, pudiendo contener información confidencial sometida a
> >>>>> secreto
> >>>>> profesional. No está permitida su reproducción o distribución sin la
> >>>>> autorización expresa de Cabify. Si usted no es el destinatario final
> >>>>> por
> >>>>> favor elimínelo e infórmenos por esta vía.
> >>>>>
> >>>>> This message and any attached file are intended exclusively for the
> >>>>> addressee, and it may be confidential. You are not allowed to copy or
> >>>>> disclose it without Cabify's prior written authorization. If you are
> >>>>> not
> >>>>> the intended recipient please delete it from your system and notify
> us
> >>>>> by
> >>>>> e-mail.
> >>>>>
> >>>> --
> >>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>>
> >>>> *Carlos Alonso*
> >>>> Data Engineer
> >>>> Madrid, Spain
> >>>>
> >>>> [email protected]
> >>>>
> >>>> Prueba gratis con este código
> >>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>>
> >>> --
> >>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>
> >>> *Carlos Alonso*
> >>> Data Engineer
> >>> Madrid, Spain
> >>>
> >>> [email protected]
> >>>
> >>> Prueba gratis con este código
> >>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>
> >> --
> >> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>
> >> *Carlos Alonso*
> >> Data Engineer
> >> Madrid, Spain
> >>
> >> [email protected]
> >>
> >> Prueba gratis con este código
> >> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>
> > --
> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > [email protected]
> >
> > Prueba gratis con este código
> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> [email protected]
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
> --
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> destinatario, pudiendo contener información confidencial sometida a secreto
> profesional. No está permitida su reproducción o distribución sin la
> autorización expresa de Cabify. Si usted no es el destinatario final por
> favor elimínelo e infórmenos por esta vía.
>
> This message and any attached file are intended exclusively for the
> addressee, and it may be confidential. You are not allowed to copy or
> disclose it without Cabify's prior written authorization. If you are not
> the intended recipient please delete it from your system and notify us by
> e-mail.
>

Reply via email to