On Sat, Sep 14, 2013 at 11:00:49AM +0100, Robert Newson wrote:
>
> We should really remove that init.d daemon script and replace it with runit.
> That way you a) are guaranteed a restart on crash and b) stdout/err is
> automatically captured (and rotated). In my experience the stdout/err in
> these events is very useful. To switch, you need runit (obviously) and then a
> short stanza that starts couchdb in the foreground, there's a switch for
> that. Alternatively, start in the foreground in a terminal (as the couchdb
> user) and pound the server until it crashes.
>
> I've no operational experience with R16 series, unfortunately. All I do know
> is, since R15, the new process scheduler can interact poorly with NIF's that
> perform work lasting over a millisecond, which I could imagine happening for
> JSON encoding/decoding of large documents.
>
> If it were a running out of file descriptors or sockets situation, I would
> expect some useful noise in the log, but we can't rule it out yet.
I just downgraded to Erlang R15B03, but haven't been running nlong
enough to crash yet.
The stdout, stderr from the last crash (with erlang R16) are a little
interesting.
stderr:
heart_beat_kill_pid = 514
heart_beat_timeout = 11
heart: Fri Sep 13 20:59:36 2013: heart-beat time-out, no activity for 15
seconds
heart: Fri Sep 13 20:59:37 2013: Executed "/usr/bin/couchdb -k" -> 0.
Terminating.
heart_beat_kill_pid = 954
heart_beat_timeout = 11
heart: Sat Sep 14 00:14:20 2013: heart-beat time-out, no activity for 15
seconds
heart: Sat Sep 14 00:14:21 2013: Executed "/usr/bin/couchdb -k" -> 0.
Terminating.
heart_beat_kill_pid = 12293
heart_beat_timeout = 11
stdout looks like
{error_logger,{{2013,9,13},{21,11,2}},std_error,"File operation error: eacces.
Target: /lost+found/ebin. Function: read_file_info. Process: code_server."}
{error_logger,{{2013,9,13},{21,11,2}},std_error,"File operation error: eacces.
Target: /root/ebin. Function: read_file_info. Process: code_server."}
=ERROR REPORT==== 13-Sep-2013::14:11:02 ===
File operation error: eacces. Target: /lost+found/ebin. Function:
read_file_info. Process: code_server.
=ERROR REPORT==== 13-Sep-2013::14:11:02 ===
File operation error: eacces. Target: /root/ebin. Function: read_file_info.
Process: code_server.
Apache CouchDB 1.4.0 (LogLevel=warn) is starting.
Apache CouchDB has started. Time to relax.
{error_logger,{{2013,9,14},{13,14,44}},std_error,"File operation error:
eacces. Target: /lost+found/ebin. Function: read_file_info. Process:
code_server."}
{error_logger,{{2013,9,14},{13,14,44}},std_error,"File operation error:
eacces. Target: /root/ebin. Function: read_file_info. Process: code_server."}
=ERROR REPORT==== 14-Sep-2013::06:14:44 ===
File operation error: eacces. Target: /lost+found/ebin. Function:
read_file_info. Process: code_server.
=ERROR REPORT==== 14-Sep-2013::06:14:44 ===
File operation error: eacces. Target: /root/ebin. Function: read_file_info.
Process: code_server.
Apache CouchDB 1.4.0 (LogLevel=warn) is starting.
Apache CouchDB has started. Time to relax.
Which actually looks kind of interesting, but I have no idea why those
files would be missing or even needed.
I'll post again after running for a while with downgraded erlang,
hopefully to say problem solved.
Is runit like node.js's forever?
Regards,
James Marca
>
> B.
>
>
> On 13 Sep 2013, at 23:20, James Marca <[email protected]> wrote:
>
> > I am seeing a lot of random, silent crashes on just *one* of my
> > CouchDB servers.
> >
> > couchdb version 1.4.0 (gentoo ebuild)
> >
> > erlang also from gentoo ebuild:
> > Erlang (BEAM) emulator version 5.10.2
> > Compiled on Fri Sep 13 08:39:20 2013
> > Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
> > [async-threads:10] [kernel-poll:false]
> >
> > I've got 3 servers running couchdb, A, B, C, and only B is crashing.
> > All of them are replicating a single db between them, with B acting as
> > the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
> > B.
> >
> > All three servers have data crunching jobs running that are reading
> > and writing to the database that is being replicated around.
> >
> > The B server, the one in the middle that is push replicating to both A
> > and C, is the one that is crashing.
> >
> > The log looks like this:
> >
> > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET
> > /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
> > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET
> > /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
> > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has
> > started on http://0.0.0.0:5984/
> > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start
> > replication `84213867ea04ca187d64dbf447660e52+continuous+create_target`
> > (document `carb_grid_state4k_push_emma64`).
> > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start
> > replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document
> > `carb_grid_state5k_hpms_push_kitty`).
> >
> > no warning that the server died or why, and nothing in the
> > /var/log/messages about anything untoward happening (no OOM killer
> > invoked or anything like that)
> >
> > The restart only happened because I manually did a
> > /etc/init.d/couchdb restart
> > Usually couchdb restarts itself, but not with this crash.
> >
> >
> >
> > I flipped the log to debug level, and still had no warning about the crash:
> >
> > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST'
> > /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
> > Headers: [{'Accept',"application/json"},
> > {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
> > {'Content-Length',"346"},
> > {'Content-Type',"application/json"},
> > {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
> > {'User-Agent',"CouchDB/1.4.0"},
> > {"X-Couch-Full-Commit","false"}]
> > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
> > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc
> > batch of size 128531 bytes
> >
> > And that was it. CouchDB was down and out.
> >
> > I even tried shutting off the data processing (so as to reduce the db
> > load) on box B, but that didn't help (all the crashing has put it far
> > behind in replicating box A and C).
> >
> > My guess is that the replication load is too big (too many
> > connections, too much data being pushed in), but I would expect some
> > sort of warning before the server dies.
> >
> > Any clues or suggestions would be appreciated. I am currently going
> > to try compling from source directly, but I don't have much faith that
> > it will make a difference.
> >
> > Thanks,
> > James Marca
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
>
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.